PA230: Reinforcement Learning Petr Novotn´y “Good and evil, reward and punishment, are the only motives to a rational creature; these are the spur and reins whereby all mankind are set on work and guided.” John Locke, Some Thoughts Concerning Education (1693) 1/294 Organizational Information General • Lecture: Thursdays 2-3:40p.m. • Homework: see the interactive syllabus in IS • mainly binary classification (accepted/not accepted) • all your homeworks need to be marked as passed to proceed to exam • can (but do not have to) be done in pairs (pairs can differ across the individual assignments) • for those who passed, the teacher will receive feedback on the general quality of the solutions for each student - can be taken into account when determining the final grade (typically in students’ favor) • Exam: • oral • each attempt counts ? (unlike the Br´azdil system) • in general, knowledge of anything mentioned on the slides can be required, unless explicitly marked with “nex” (like the Br´azdil system) 2/294 Team • Lecturer: Petr Novotn´y • HW team: Martin Kureˇcka V´aclav Nevyhoˇstˇen´y V´ıt Unˇcovsk´y 3/294 Communication Official discord server: https://discord.gg/9mxTgYhcdB • Official communication forum of the course: falls under the university ethical guidelines. • Use your real name for posting (you can set-up an account under your IS email if necessary). 4/294 Reading • Compulsory: • these slides, • material explicitly prescribed by these slides (not much). • Recommended: • Sutton & Barto: Reinforcement Learning: An Introduction (2nd ed.), available at http://incompleteideas.net/book/RLbook2020.pdf • henceforth referenced as “S&B” • slides by David Silver https://www.davidsilver.uk/teaching/ • CMU slides https://www.andrew.cmu.edu/course/10-703/ • more specific literature recommendations will be given for each topic later 5/294 Reinforcement Learning: What, Why, When, How, & Other Questions Types of machine learning • unsupervised • spot ”useful” patterns in data • supervised • given labeled data, predict labels on unlabeled data • reinforcement • agents and decision-making • agency = “the ability to take action or to choose what action to take” (Cambridge Dictionary) 6/294 General RL scheme source: Sutton&Barto, p. 48 Keywords: sequential, dynamic, subject to uncertainty 7/294 RL: Objective and approach • Objective: Design a decision policy (= agent behavior) which prescribes to the agent how to act in different situations (states), typically so as to achieve some goal. • Approach: Start with (± random) behavior and adapt it based on past experience via the law of effect: • actions with good/bad consequences for the agent are more/less likely to be repeated by the agent (within the same context) 8/294 RL in psychology (nex) I.P. Pavlov (1849-1936) classical conditioning E. Thorndike (1874-1949) law of effect J.B. Watson (1878-1958) behaviorist manifesto B.F. Skinner (1904-1990) radical behaviorism, reinforcement, rewards 9/294 Learning by trying & XX dilemma Underlying the RL approach is the idea of learning by trying: • first, act more or less randomly (exploration) • integral part of early human development • continually adapt behavior according to experience and feedback from the environment (exploitation) • strength of feedback ≈ strength of behavior adaptation Balancing exploration and exploitation (XX) is a recurring theme in RL. 10/294 Incomplete history of RL in computer science I “Learning by trying” machines and software, ad hoc approaches: A. Turing (1912-1954) 1948: theoretical “pleasure & pain” sys- tem to train computers C. Shannon (1916-2001) 1950: Theseus maze-solving mouse M. Minsky (1927-2016) 1950s: analog neural net machines (SNARCS) And many more. . . Recommended: S&B: Sec. 1.7. 11/294 Incomplete history of RL in computer science II Mathematical foundations of sequential decision making: R. Bellman (1920-1984) R. Howard (b. 1934) • Formalization via Markov decision processes (MDPs) • value iteration (attributed to Bellman, 1957) • policy iteration (attributed to Howard, 1960) 12/294 Incomplete history of RL in computer science III Since late 1980’s: synthesis – learning by trial in MDPs R. Sutton A. Barto Temporal difference learning C. Watkins Q-learning . . . and many more. 13/294 Successes of RL (nex) 14/294 Words of caution (and controversy) (nex) 15/294 Mathematical Foundations of Sequential Decision-Making MDP Example MDP with actions, rewards and transition probabilities. start next passed sleep study: -2 1 study: -2 1 done: 20 1 done: 0 1 procrast.: 4 1 leave: -1 1 2 1 2 procrast.: -1 1 pub: +4 0.3 0.4 0.3 16/294 Markov Decision Process Given a set X, we denote D(X) the set of all probability distributions over X. Definition 1 A Markov decision process (MDP) is a tuple (S, A, p, r) where • S is a set of states, • A is a set of actions, • p: S × A → D(S) is a probabilistic transition function, • r : S × A → R is a reward function. We will shorten p(s, a)(s′ ) to p(s′ | s, a). The p, r can be partial functions: action a is enabled in state s if both p(s, a) and r(s, a) are defined. We denote by A(s) the set of all actions enabled in s. 17/294 Dynamics of MDPs • start in some initial state s0 • MDP evolves in discrete time steps t = 0, 1, 2, 3, . . . • in each time step t, let st be the current state; then: • agent selects action at ∈ A(st ) • the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward rt+1 = r(st , at ) • t is incremented and the process repeats in the same fashion forever Thus, the agent produces a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .. τ is produced randomly (due to p and possibly also agent choices being probabilistic): it is a random variable and so are its components: we define random variables • St = state at time step t • At = action at time step t • Rt = reward received just before entering St 18/294 Policies Definition 2 A history is a finite prefix of a trajectory ending in a state, i.e., an object of type s0, a0, r1, s1, a1, r2, . . . , at−1, rt, st ∈ (S · A · R)∗ S. we denote by last(h) the last state of a history h. Definition 3 A policy is a function π: (S · A · R)∗ S → D(A) which to each history h assigns a probability distribution over A(last(h)). A policy is by definition an infinite object! 19/294 MD policies Definition 4 A policy π is: • memoryless if π(h) = π(h′ ) whenever last(h) = last(h′ ) (we can view memoryless policies as objects of type π: S → D(A)); • deterministic if π(h) always assigns probability 1 to one action, and zero to all others (we can view det. policies of objects of type π: (S · A · R)∗ S → A). Definition 5 A policy π is MD (memoryless deterministic) if it is both memoryless and deterministic. 20/294 Dynamics of MDPs (more precise) Given a distribution I of initial states and a policy π • start in some initial state s0 ∼ I • MDP evolves in discrete time steps t = 0, 1, 2, 3, . . . • in each time step t, let ht be the history produced so far; then: • agent selects action at ∈ A(st ) according to π, i.e. at ∼ π(ht ) • the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward rt+1 = r(st , at ), the history is extended by at , rt , st+1, • t is incremented and the process repeats in the same fashion forever 21/294 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 We denote by Eπ the associated expected value (expectation) operator. We denote by Pπ [E | S0 = s] the probability of event E provided that the initial state is fixed to s (and similarly for expectations). Exercise 6 In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub. Compute the following quantities: • Pπ [visit pub at least twice′ | S0 = start′′ ] • Pπ [visit pub at exactly twice | S0 = start′′ ] • Eπ [R1] • Eπ [R3] 1I is typically known from the context and hence omitted from the notation 22/294 Memorylessness In this course, we will almost exclusively focus on memoryless policies. Hence, from now on, policy = memoryless policy. General policies will be referred to as history-dependent policies should the need arise. Why memoryless? Intuition: Markov property of MDPs: next step depends only on the current state and on action performed in the current step. Hence, intuitively there is no need for a policy to remember the past so as to “play well”. The sufficiency of memoryless policies does not extended to more general/complex decision-making settings (not covered in this course), such as: • partially observable MDPs • non-stationary environments • quantile/risk-aware MDPs, etc. 23/294 Returns (payoffs) Definition 7 Let γ ∈ [0, 1) be a discount factor. For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . we define the discounted return (or payoff) of τ to be the quantity G(τ) = r1 + γ · r2 + γ2 · r3 + · · · γ3 · r4 = ∞ i=0 γi · ri+1. Equivalently G = ∞ i=0 γi · Ri+1. 24/294 Returns (variants) • Finite horizon (FH): additionally, we are given a finite decision horizon H ∈ N ∪ {∞}. The return is that counted only up to step H: GH = H−1 i=0 γi · Ri+1 For finite H, the discount factor can be 1. H = ∞ corresponds to the original definition. • Episodic returns: In episodic tasks, there is a distinguished set Term ⊆ S of terminal states which is guaranteed to be reached with probability 1 under any policy. We denote by T a random variable denoting the first point in time when we hit a terminal state. We count rewards only up to that time: GT = T−1 i=1 γi · Ri+1 Can be modeled under original definition by “sink” states. 25/294 Types of returns: discussion • We will typically omit the superscripts since the type of task considered will be known from the context. • We have GH → G (pointwise) as H → ∞. I.e., finite-horizon returns with high enough H approximate the standard (infinite-horizon) case. • In real world, we typically deal with FH or episodic tasks: we cannot wait infinite time to learn something from a trajectory. However, the infinite-horizon case can be viewed as a neat mathematical abstraction of the FH&episodic tasks, and the classical sequential decision-making theory is most developed for the infinite horizon case. 26/294 Policy and state values Definition 8 Let π be a policy and s a state. The value of π in state s is the quantity vπ (s) = Eπ [G | S0 = s]. Exercise 9 Discuss the values of MD policies in our running example. Definition 10 The (optimal) value of state s is the quantity v∗ (s) = sup π vπ (s). 27/294 Optimality Definition 11 Let π be a policy and ε > 0. We say that π is ε-optimal in state s if vπ (s) ≥ v∗ (s) − ε. We say that π is optimal in s is it is 0-optimal in s, i.e. if vπ (s) = v∗ (s). A policy is (ε-)optimal if it is (ε-)optimal in every state. 28/294 Existence of optimal policies Theorem 12: (Classical result, not formally proven here) Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can be computed in polynomial time. Agent control solved? NO! “Only” works if you can actually construct the MDP model of your environment and fit it into a computer. Otherwise, we use reinforcement learning. 29/294 Exact Planning with Known Model: Value & Policy Iteration Goal of this lecture Algorithms that compute the optimal value vector v∗ and some optimal MD policy π∗ given a full knowledge of an MDP M. 30/294 Polynomial-time algorithm MDPs can be solved by linear programming (LP) maximize ⃗c · ⃗x subject to A · ⃗x ≤ ⃗b • LP can be solved in polynomial time by so-called interior-point algorithms. • However, we typically use other, MDP-specific algorithms: value iteration (VI) and policy iteration (PI). These are not polynomial-time in general, but typically faster on practical instances. • Moreover, most truly RL algorithms can be seen as approximate generalizations of VI or PI (or both). 31/294 Example: Policy evaluation Exercise 13 Consider all four MD policies in our running “pub or study” example that always try to quit when in X state. Compute the values of these policies in the initial state start. 32/294 Policy evaluation equations Theorem 14 For any memoryless policy π and any state s it holds: vπ (s) = a∈A(s) π(a|s) · r(s, a) + γ · s′∈S p(s′ |s, a) · vπ (s′ ) def =qπ(s,a) . 33/294 Bellman optimality equations Theorem 15 The following holds for any state s: v∗ (s) = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · v∗ (s′ ) def =q∗(s,a) • Note: the policy evaluation equations are a special case of the Bellman ones: given a policy π, we can consider an MDP Mπ in which there is a single action ∗ enabled in each state and the probability of transition s ∗ → s′ equals a∈A(s) π(a|s) · p(s′ |s, a). Then Mπ mimics the behavior of π in M and Bellman eq’s in Mπ = evaluation equations for π in M. • But these equations are no longer linear! How do we solve them? Is the solution even unique? 34/294 Bellman update operator The right-hand-side (RHS) of the Bellman equations can be viewed as an operator Φ: RS → RS : for any ⃗x ∈ RS , Φ(⃗x) is a vector such that for any state s: Φ(⃗x)(s) def = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ ) Exercise 16 In our running example, compute Φ(⃗0). Theorem 15 says that the optimal value vector v∗ is a fixed point of Φ: v∗ = Φ(v∗ ). 35/294 Mathematical hammers for Bellman Lemma 17: (not proven here) For any discount factor γ ∈ [0, 1), the Bellman operator Φ is a contraction, i.e. for any pair of vectors ⃗x, ⃗y it holds ∥Φ(⃗x) − Φ(⃗y)∥∞ ≤ γ · ∥⃗x − ⃗y∥∞. Theorem 18: Banach fixed point theorem (classical calculus, not proven here) A contraction mapping from a complete metric space (in particular, Rn ) to itself has a unique fixed point. 36/294 Exact characterization of v∗ Corollary 19 The optimal value vector is a unique solution of the Bellman optimality equations. In particular, also the policy evaluation equations have a unique solution, equal to vπ . Since the policy evaluation equations are linear, their solution can be computed by Gaussian elimination. But the general Bellman equations are not linear. How can ve solve them? 37/294 Banach fixpoint theorem (full) Theorem 20: Banach fixed point theorem (full version, not proven here) A contraction mapping Φ from a complete metric space (in particular, Rn ) to itself has a unique fixed point ⃗z. Moreover, ⃗z is the limit of iterative applications of Φ on any initial vector. I.e., for any ⃗x0 ∈ Rn , the sequence ⃗x0, Φ(⃗x0), Φ(Φ(⃗x0)), Φ(3) (⃗x0), . . . converges to ⃗z: z = lim i→∞ Φ(i) (⃗x0) 38/294 Value iteration (VI; Bellman, 1957) Algorithm 1: Value iteration Input: MDP M = (S, A, p, r) Output: Approximation ˜v of v∗ x ← any vector from R|S| ; // typically ⃗0 next ← x; repeat foreach s ∈ S do next(s) ← max a∈A(s) [r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ )] Φ(x)(s) ; x ← next until termination condition; Typical term. conditions: • after a fixed no. of iterations (i.e., use for-loop with a fixed bound) • after each component of x changes less then some given ε 39/294 How to use VI By the Banach fixpoint theorem (and Lemma 17), the value of variable x VI converges to v∗ . Can we recognize when is x “close enough” to ⃗x? In the following couple of theorems, let ⃗x0, ⃗x1, ⃗x2, . . . be the sequence of vectors computed by VI, i.e. ⃗x0 is arbitrary and ⃗xi+1 = Φ(⃗xi ) for all i ≥ 0. Theorem 21: Stopping condition (not proven here) For any ε > 0: if ∥⃗xi+1 − ⃗xi ∥∞ ≤ ε · 1 − γ γ , then ∥⃗xi+1 − v∗ ∥∞ ≤ ε 40/294 How to use VI (2) How fast can we get to the point where we are close enough? Theorem 22: Speed of convergence (not proven here) For all i ≥ 0 it holds ∥⃗xn − v∗ ∥∞ ≤ γn 1 − γ · ∥⃗x1 − ⃗x0∥∞. In particular, if we terminate VI after i =     log(ε) + log 1−γ ∥⃗x1−⃗x0∥∞ log(γ)     steps, then its output xi will be an ε-approximation of v∗ . 41/294 How to use VI (3) Can we actually get some optimal values instead of approximations? First, note that VI computes optimal finite-horizon values: Let vi = supπ Eπ [ H i=1 γi−1 · Ri ]. The supremum is over all (i.e., history dependent) policies, since in the FH problem an optimal policy needs to track the number of elapsed (and thus remaining) steps: memory is needed for that. Theorem 23: (Easy but important exercise) If ⃗x0 = ⃗0, then ⃗xH = vH for all H ≥ 0. Moreover, let πH be a deterministic history-dependent policy such that for all 1 ≤ i ≤ H, whenever there are i steps remaining till the horizon, the policy πH selects in state s an action a s.t. a = arg max a∈A(s) [r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗xi−1(s′ )] (with ties broken arbitrarily). Then πH is an optimal H-step policy. 42/294 Greedy policies Can we actually get optimal policy for the inf. horizon problem? Definition 24: ⃗x-greedy policy (very important) Let ⃗x ∈ RS be any vector. A ⃗x-greedy policy is an MD policy π such that in any state s: π(s) = arg max a∈A(s) [r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ )]. 43/294 How to use VI (4) Theorem 25: Optimal inf.-horizon policy from VI (not proven here) There is a number N polynomial in size of the MDP and exponential in the binary encoding size of γ such that a policy π that is ⃗xN -greedy is optimal in every state, i.e. vπ = v∗ . Note that once π is computed, vπ can be computed in polynomial time via policy evaluation equations. Hence, VI can be said to solve MDPs in exponential time (and in polynomial time if the discount factor is assumed to be a fixed constant instead of an input parameter), though the approximate version is typically used in practice. Note: the fact that some policy π is ⃗x-greedy does not mean that vπ ≥ ⃗x! Homework: find a counterexample and post it to Discord. However, for VI it can be shown that if ∥⃗xi+1 − ⃗xi ∥∞ ≤ ε · 1−γ γ (stopping condition from Theorem 21), then an ⃗xi+1-greedy policy is ε-optimal. 44/294 Policy improvement start next passed sleep study: -2 1 study: -2 1 done: 20 1 done: 0 1 procrast.: 4 1 leave: -1 1 2 1 2 procrast.: -1 1 pub: +4 0.3 0.4 0.3 vπ = ( ) let π′ be vπ -greedy vπ′ = ( ) let π′′ be vπ′ -greedy 45/294 Policy improvement theorem Theorem 26: Policy improvement Let π be a policy. If Φ(vπ ) ≥ vπ , then any vπ -greedy policy πg is at least as good as π, i.e. ∀s ∈ S : vπg (s) ≥ vπ (s). Moreover, if Φ(vπ )(s) > vπ (s) for some state s, then also vπg (s′ ) > vπ (s′ ) for some state s′ . 46/294 Returns from a given time step For the proof of PIT and also many times later, we will need the following notation: Definition 27: Important! Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t ∈ N a time step. We define Gt(τ) = H−1 i=t γi−t · ri+1 = rt+1 + γrt+2 + γ2 rt+3 + · · · , where H ∈ N ∪ {∞} or H = T for episodic tasks. We similarly define, for any policy π: Gπ t = Eπ [Gt] = Eπ [ H−1 i=t γi−t · Ri+1]. 47/294 Proof of PIT (setup) We will define a sequence of policies π0, π1, π2, . . . s.t.: • π0 = π • πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then “switches” back to behave as π: • we also define π∞ = πg We want: vπ∞ (s) ≥ vπ (s) for all s. Not hard to see: vπi → vπ∞ as i → ∞ (πi behaves as π∞ for longer and longer as i increases + discounting). It suffices to show: vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S. 48/294 Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear • i > 0: vπi (s) = Eπi [R1 + γR2 + · · · + γi−1 Ri + γi Ri+1 + · · · | S0 = s] = Eπi [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] Suppose we prove ≥Eπi−1 [γi−1 Ri +γi Ri+1···|S0=s] ≥ Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 + γi−1 Ri + γi Ri+1 · · · | S0 = s] IH ≥ vπ (s) 49/294 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) =Φ(vπ), since πg is vπ-greedy = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) 50/294 Policy iteration (PI; Howard, 1960) Algorithm 2: Policy iteration Input: MDP M = (S, A, p, r) Output: Optimal MD policy π∗ for M, its value vector v∗ π ← arbitrary MD policy ; v ← vπ ; // e.g. by solving linear policy evaluation equations while Φ(v) ̸= v do foreach s ∈ S do π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′ |s, a) · v(s′ )] v ← vπ return π, v 51/294 PI correctness & complexity Theorem 28 Policy iteration terminates after at most exponentially many iterations. Upon termination, it returns an optimal MD policy. Proof: • Optimal upon termination: vπ is a fixpoint of Φ when terminating: optimality follows from Corollary 19. • Terminates: π always stores an MD policy and there are finitely many of these. We will show that no single MD policy appears in more than one iteration of PI. Consider any iteration and let v, v′ be the contents of variable v before and after the iteration. We will show that unless Φ(v) = v, it holds v′ > v, i.e. v′ ≥ v componentwise with strict inequality in some component. Hence, v = vπ strictly increases during PI, so no π can appear twice. 52/294 PI: correctness proof v′ ≥ v: We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ . For all s ∈ S: Φ(v)(s) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · v(s′ )] ≥ r(s, π(s)) + γ · s′∈S p(s′ | s, π(s)) · vπ (s′ ) = vπ (s) = v(s). By PIT, v′ = vπ′ ≥ vπ = v (here π′ is the v-greedy policy). 53/294 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: v′ (s) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ′ (s′ ) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ (s′ ) (assumption) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ (s′ )] (π′ is v = vπ -greedy) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ′ (s′ )] (assumption) = Φ(v′ )(s), so PI terminates at this point. Complexity? 54/294 PI&VI: discussion • We know that MDPs have a linear programming (LP) formulation. PI is basically a variant of a simplex method for this LP, using a special pivoting rule. • PI typicaly requires less iterations to converge than VI, though each iteration is more expensive (policy eval.) • Both PI and VI typically work well in practice for MDPs whose explicit transition table fits inside a computer. Which of the two is faster is rather domain-specific. 55/294 PI variants Can we get rid of the expensive policy evaluation by linear system solving? Yes: we can approximate the value of the current policy π by applying VI on the MDP Mπ , for either fixed number of steps or until v does not change much. Often appearing in RL textbooks: 56/294 Policy iteration with approximate evaluation π ← arbitrary MD policy; v ← arbitrary vector; repeat v ← Eval(π, v); foreach s ∈ S do π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′ |s, a) · v(s′ )] until π has not changed; return π, v Function Eval(π, v): v′ ← v; repeat foreach s ∈ S do v(s) ← v′ (s); v′ (s) ← r(s, π(s)) + γ · s′∈S p(s′ | s, π′ (s)) · v(s′ ) until ∥v − v′ ∥∞ ≤ ε; return v′ 57/294 Convergence of PI variants • The algorithm on previous slide still converges to an optimal policy provided that ε is small enough. • If we replaced the “π not changed condition” with the original “Φ(v) = v” condition, the algorithm might not terminate, since the VI is only guaranteed to reach a true fixpoint in the limit. However, v would still converge to v∗ and thus π would eventually become equal to an optimal policy. • The previous point holds even in the very degenerate case when we do just one iteration of VI per policy evaluation! See next slide. 58/294 Curiously looking approximate PI v ← arbitrary vector; π ← v-greedy MD policy ; repeat foreach s ∈ S do v′ (s) ← r(s, π(s)) + γ · s′∈S p(s′ |s, π(s)) · v(s′ ); v ← v′ ; foreach s ∈ S do π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′ |s, a) · v(s′ )] until Φ(v) = v; return π, v This is just VI in disguise! 59/294 Generalized policy iteration Evaluate Improve push v towards vπ make π (more) v-greedy Source: Sutton&Barto, p. 87 60/294 Tabular Methods for Model-Free Reinforcement Learning Model-free We will still be working with MDPs. But for a bunch of the following lectures, we will not (necessarily) have access to, e.g.: • a table containing explicit enumeration of all states/actions • a table containing the description of p or r • the ability to compute the probability vector δ(s, a) or the reward signal r(s, a) given s and a (having this = gray-box model of the MDP) 61/294 Sampling from MDP But the MDP is still there “behind the scene”. In particular, we: • know how the states of the MDP look like • (e.g. robot state = all possible output values of its sensors) • know how the actions of the MDP look like • (e.g. robot = all possible signals that can be sent to the actuators) • can, for any s ∈ S, enumerate A(s) • could be weakened, but simplifies things • given s ∈ S and a ∈ A(s), we can sample the next state s′ ∼ p(s, a) and receive the reward r(s, a). 62/294 Sampling from a policy Given an effective representation of a policy π, we can sample a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . by performing, for each t ∈ {0, . . . , T}: • sample at ∼ π(st) • query the environment for st+1 ∼ p(st, at) and rt+1 = r(st, at) • increment t Tabular = value estimates and policies represented as tables (e.g. Q(s, a) for each state s and action a used in s – explicit representation might only be needed for states/actions actually encountered). 63/294 Basic classification of (tabular) RL algorithms Three independent axes: problem on/off updates policy evaluation (value prediction) on-policy Monte Carlo vs. vs.       control off-policy temporal difference 64/294 Assumptions: successor-dependent rewards & episodic tasks Since we do no longer have the knowledge of the transition dynamics p, we cannot freely interchange MDPs with rewards functions of type S × A × S → R and S × A → R via the equation r(s, a) = s′∈S p(s′ | s, a) · r(s, a, s′ ). Hence, to maintain generality (and correspondence to e.g. Gymnasium environments) we will assume reward functions of type S × A × S → R. We will assume episodic returns: each trajectory terminates with probability 1 at some (possibly random) time step T. Termination can be defined e.g. by reaching some terminal state or by running out of some fixed decision horizon (in Gymnasium, this is sometimes called truncation): G = T−1 i=0 γi · Ri+1. Episode = one high-level iteration of an RL algorithm, corresponding of sampling a single trajectory from some policy. 65/294 Monte Carlo Methods MC evaluation Policy evaluation: given an effective representation of a policy π, estimate vπ (or qπ ). Naive Monte Carlo: Sample from π: if {τ1, τ2, . . . , τn} are trajectories (episodes) independently sampled under π from the same initial state s, then 1 n n i=1 G(τi ) → vπ (s) as n → ∞ due to law of large numbers (LLN). But this throws away a lot of valuable information! E.g. what if we want to estimate the whole vπ ? 66/294 First-visit MC For each s, we estimate vπ (s) as an average of sample returns Ret(s) which is formed as follows: • initially, Ret(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, we identify the first occurrence of s on τ: let this be at timestep t; we add Gt (τ) to Ret(s) start X start next pass. sleep 4 -1 -2 -2 20 Sub-trajectory starting at the first appearance of s can be seen as a trajectory sampled from π when s is the initial state! (Since we consider memoryless π.) Theorem 29 As |Ret(s)| → ∞, the average of Ret(s) converges to vπ (s). Moreover, the average of Ret(s) is an unbiased estimate of vπ (s) (as long as Ret(s) ̸= ∅). 67/294 First-visit MC (pseudocode) Source: Sutton&Barto, p.92 68/294 Every-visit MC For each s, we estimate vπ (s) as an average of sample returns Ret(s) which is formed as follows: • initially, Ret(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, and each t such that St (τ) = s we add Gt (τ) to Ret(s) start X start next pass. sleep 2 -1 -2 -2 20 The sample returns added to Ret(s) within the same episode are not independent! Hence, the estimate is biased, though the bias vanishes in the limit: Theorem 30 As |Ret(s)| → ∞, the average of Ret(s) converges to vπ (s). 69/294 MC convergence Optional reading: More on MC estimate bias, variance, and convergence in: Singh, S.P. and Sutton, R.S.: Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning 22:123–158. Kluwer, 1996. (Section 3, particularly 3.3 and onwards, you can skip Theorem 4.) 70/294 Towards MC control Control = computation of “good” policy for a given environment. (Ideally, the policy should get closer to the optimal policy the more episodes we sample.) We know (PIT): given a policy π a vπ -greedy policy is at least as good as π: πg (s) = arg max a∈A(s) s′∈S p(s′ | s, a) · r(s, a, s′ ) + γ · vπ (s′ ) Do we have an algo? There is an issue: 71/294 MC control with q-values Recall: qπ (s, a) def = s′∈S p(s′ | s, a) · r(s, a, s′ ) + γ · vπ (s′ ). Thus, the vπ -greedy policy πg can be defined as: πg (s) = arg max a∈A(s) qπ (s, a) Estimate by MC. 72/294 MC for q-value estimation Analogous to value estimation, e.g. first-visit: For each s, a, we estimate qπ (s, a) as an average of sample returns Ret(s, a) which is formed as follows: • initially, Ret(s, a) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state-action pair (s, a), we identify the first t such that St (τ) = s ∧ At (τ) = a; we add Gt (τ) to Ret(s) start X start next pass. sleep proc.:4 study:-1 study:-2 done:-2 done:20 Similarly for every visit. Convergence guarantees the same as for state values. 73/294 Infinite exploration and exploring starts Issue: MC only estimates qπ (s, a) if: • s guaranteed to be visited with positive probability in each episode • π(a | s) > 0. Definition 31: Infinite exploration A RL algorithm has infinite exploration (IE) if, during the infinite execution of the algorithm, each state-action pair (s, a) is visited infinitely often with probability 1. One way of achieving IE is through exploring starts (ES): each episode begins with (typically uniformly) randomly selected s0 and a0. This is achievable when training, e.g., in simulated environments but might be difficult/impossible in real-world environments. 74/294 MC control with exploring starts Source: Sutton&Barto, p.99 75/294 IE through ε-soft policies Exploring starts are not always feasible. Alternative: make the sampled policy itself exploratory. Definition 32: ε-soft policy A policy π is ε-soft if for every s ∈ S and every a ∈ A(s) it holds π(a|s) ≥ ε |A(s)| . Definition 33: ε-greedy policy Let v ∈ RS be a value vector. A policy π is v-ε-greedy if for every state s ∈ S there is action a∗ = arg maxa∈A(s) s′∈S p(s′ | s, a) · r(s, a, s′ ) + γ · v(s′ ) such that for any action a ∈ A(s) it holds: π(a|s) =    ε A(s) if a ̸= a∗ 1 − ε + ε A(s) if a = a∗ . Interpretation: with prob. ε: play uniformly at random; with prob. 1 − ε: play greedily. 76/294 ε-softing Definition 34 Let π be a policy. An ε-softing of π is a policy πε defined as follows: in each state s • with probability ε, πε selects an action uniformly at random; • with probability 1 − ε, πε selects a ∼ π(s). I.e., an ε-greedy policy can be alternatively defined as ε-softing of a greedy policy. 77/294 MC control with ε-greedy policies source: Sutton&Barto, p. 101 78/294 Policy iteration for ε-soft policies Theorem 35 Let π be an ε-soft policy and let π′ be a vπ -ε-greedy policy. Than vπ′ ≥ vπ (componentwise). Moreover, the two value vectors are equal if and only if bot π and π′ are optimal among all ε-soft policies; i.e. if, for every state s: vπ (s) = sup ¯π that is ε-soft v ¯π (s). Proof: Required reading: Sutton&Barto, p.101-103. 79/294 Incremental computing of averages Given a sample {n1, n2, . . . , nk+1} and average A = avg({n1, n2, . . . , nk }), how to compute A′ = avg({n1, n2, . . . , nk , nk+1}) without recomputing the average of the whole sample? A′ = k k + 1 · A + nk+1 k + 1 . 80/294 On-policy vs. off-policy • On-policy algorithms: track one “policy variable” π; the policy stored in π is used to interact with the environment (i.e., to sample episodes) and at the same time we learn something about it (e.g. its value vector). • Corresponds to the generalized policy iteration scheme. • All the MC algos we have seen so far. • Off-policy algorithms: track more (typically two) different policy variables: • behavior policy: used to sample episodes • target policy: which we want to learn about 81/294 Off-policy evaluation We are given effective representations of: • a behavior policy β, • a target policy π. The task is to estimate vπ by sampling episodes from β . We cannot sample from π! (E.g. π too risky or expensive to sample from.) Assumptions: • given (s, a), we can effectively compute π(a|s) and β(a|s) (or at least estimate via sampling) • coverage: ∀s ∈ S, a ∈ A(s): if π(a|s) > 0, then also β(a|s) > 0 82/294 Importance sampling Definition 36: Importance ratio Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. The importance-sampling ratio of τ is the quantity ρ(τ) def = Pπ [τ | S0 = s0] Pβ[τ | S0 = s0] = Pπ [A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0] Pβ[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0] . ρ(τ) can be computed without the knowledge of MDP transition probabilities! ρ(τ) = π(a0 | s0) · p(s1 | s0, a0) · π(a1 | s1) · p(s2 | s1, a1) · · · β(a0 | s0) · p(s1 | s0, a0) · β(a1 | s1) · p(s2 | s1, a1) · · · 83/294 Importance ratio from time t Definition 37 Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. By τi..j we denote the subtrajectory of τ starting in time step i and ending in timestep j. By τi.. we denote the suffix of si , ai , ri+1, si+1, ai+1, . . .. Definition 38: Importance ratio from time t Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t a time step. The importance-sampling ratio of τ from t is the quantity ρt(τ) def = Pπ [τt.. | S0 = st] Pβ[τt.. | S0 = st] = Pπ [A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st] Pβ[A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st] . 84/294 Off-policy evaluation with importance sampling Theorem 39 For any s ∈ S it holds: Eβ [ρ · G | S0 = s] = vπ (s). Proof: Eβ [ρ · G | S0 = s] = τ Pβ [τ | S0 = s] · ρ(τ) · G(τ) = τ Pβ [τ | S0 = s] · Pπ [τ | S0 = s] Pβ[τ | S0 = s] · G(τ) = τ Pπ [τ | S0 = s] · G(τ) = Eπ [G | S0 = s] = vπ (s). Easily integrates into both first-visit and every visit MC: sample from β and store ρt(τ) · Gt(τ) in Ret(st). 85/294 Weighted importance sampling First-visit variant: for each state s, we keep a set of samples Sam(s). Each sample is a tuple (τ, t) – trajectory and time step. • initially, Sam(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, and the smallest t such that St (τ) = s we add (τ, t) to Sam(s) Throughout the algorithm, the value of state s is estimated as WIS(s) = (τ,t)∈Sam(s) ρt(τ) · Gt(τ) (τ,t)∈Sam(s) ρt(τ) Exercise 40 Compare ordinary/weighted importance sampling after single sample. 86/294 Weighted importance sampling – correctness The weighted sampling is clearly a biased estimator. However, the bias vanishes in the limit: Theorem 41 With probability 1: as |Sam(s)| → ∞, we have that WIS(s) → vπ (s). Proof: 87/294 Ordinary vs. weighted sampling source: Sutton&Barto, p. 106 88/294 Importance sampling: summary But ordinary and weighted importance sampling can be adapted to every-visit MC. Bias & Convergence: • First visit: • ordinary IS: unbiased, i.e. also converges • weighted IS: biased, but converges in the limit • Every visit: • both ordinary and weighted: biased (due to EV), but converges in the limit 89/294 Weighted IS: incremental implementation Instead of recomputing the weighted average for each new sample, WIS(s) can be updated by keeping keep just two variables: • V – current value of WIS(s), initially arbitrary • C – the sum of importance ratios, initially 0 Upon arrival of new sample (τ′ , t′ ), we update V , C into new values V ′ , C′ by setting: C′ = C + ρt′ (τ′ ) V ′ = V + ρt′ (τ′ ) C′ · (Gt′ (τ′ ) − V ) . 90/294 Off-policy evaluation with weighted IS 91/294 Off-policy control with weighted IS Required reading: Sutton&Barto, Section 5.7. 92/294 Temporal Difference Methods TD: Motivation Let us first focus on policy evaluation. MC: zero bias (at least in the limit), but potentially high variance: many samples needed to converge. Also, to update estimates, it must wait till the end of each episode. TD methods retain the focus on sampling but combine it with bootstrapping. 93/294 Incremental update notation Definition 42: Notation for updates In the context of RL algorithms will denote by V n (s) (resp. Qn (s, a)) the algorithm’s estimate of vπ (s) (resp. qπ (s, a)) after n-th update of this estimate. 94/294 MC vs. TD(0) update On-policy MC (incremental) update using sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .: V n+1 (st) ← (1 − αn)V n (st) + αnGt(τ) = V n (st) + αn · Gt(τ) update target −V n (st) update error , where αn = n/(n + 1). TD(0) update in the same situation, with αn “suitably chosen” (possibly constant): V n+1 (st) ← V n (st) + αn · Rt+1(τ) + γ · V n (St+1(τ)) bootstrap −V n (st) 95/294 Policy evaluation with TD(0) source: Sutton&Barto, p. 120 96/294 How can it even work? Really “just” a very asynchronous, sample-based, and “α-dampened” version of value iteration. Eπ [Gt|St = s] = Eπ [Rt+1 + γ · Gt+1 | St = s] = Eπ [Rt+1 | St = s] + γ · Eπ [Gt+1 | St = s] vπ(St+1) . In expectation, the TD(0) update is the same as VI update in Mπ . Thanks to the contractivity of the Bellman operator, VI possesses an error reduction property: after each update, the error of the estimate decreases. Hence, in expectation, the same is true for the TD(0) update. Formal proof of correctness in optional reading: Sutton, R.S.: Learning to Predict by Methods of Temporal Differences. In Machine Learning 3:9–44. Kluwer, 1988. (For MDPs with function approximation.) 97/294 Why TD is natural (Sutton&Barto, p. 122-123) Left: MC. Right TD(0). 98/294 On-policy TD control Recall: • In control setting, we need to estimate q-values of a policy. • On-policy: we sample trajectories according to some policy π and then push value estimates towards qπ . To maintain exploration, the policy π will typically be the ε-Q-greedy policy for some ε > 0, where Q are the current Q values estimates. I.e., throughout the algorithm π(a|s) =    1 − ε + ε A(s) if a = arg maxa′∈A(s) Q(s, a′ ) ε A(s) otherwise. 99/294 SARSA State-Action-Reward-State-Action. Introduced in Rummery, Niranjan: On-Line Q-Learning Using Connectionist Systems (1994). In each episode, sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , sT according to current policy π; for each time step 0 ≤ t ≤ T − 1, perform the following update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γQn (st+1, at+1) − Qn (st, at) The update can be performed immediately when st+1 and at+1 is known (no need to wait for the episode to terminate). After the episode ends, make π ε-Q-greedy. Conforms to the GVI scheme. 100/294 SARSA pseudocode source: Sutton&Barto, p. 130 101/294 GLIE policies Definition 43: GLIE condition A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy algorithms) converges to a 0-greedy policy with increasing number of episodes. A RL algorithm is GLIE if it is GL and IE (infinitely exploring). Typical ways of ensuring GLIE: • Dynamically adjust ε in ε-greedy policy selection” When selecting action in state s, behave ε-greedily with ε = c n(s) , where 0 < c < 1 is a constant and n(s) is a number of visits to state s over all the episodes so far. • Use Boltzmann (softmax) exploration: π(a | s) = e Q(s,a) η(s) b∈A(s) e Q(s,b) η(s) , where η is a state-dependent and time-varying temperature parameter. We need η to converge to 0 over time, but not too fast (often, η(s) proportional to 1 log(n(s)) ). 102/294 Convergence of SARSA Theorem 44 Consider a GLIE instantiation of SARSA. Moreover, assume that the sequence of learning rates (αn)n∈N satisfies n αn = ∞ and n α2 n < ∞. In this setting, Q converges to q∗ and the behavior policy of SARSA converges to some optimal policy π∗ . For the proof, see optional reading: Singh, Jaakkola, Littman, Szepesv´ari: Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. In Machine Learning 39:287-308. Kluwer, 2000. Note: learning rate can itself be state/action dependent (omitted for conciseness, constant learning rates preferred in practice). 103/294 Off-policy TD control Surprise: no importance sampling! Recall the SARSA update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γQn (st+1, at+1) − Qn (st, at) It pushes Q towards qπ , where π is the current policy. Idea: push Q directly towards q∗ . We could do this e.g. by a VI-like update: Qn+1 (st, at) = Qn (st, at) + αn · max a∈A(s) s′∈S p(s′ | s, a) r(s, a, s′ ) + γV (s′ ) − Qn (st, at) . Two problems: • We do not calculate v-estimates. (Must somehow replace with Q) • We must get rid of transition probabilities and instead use the sampled at and rt+1. 104/294 Q-learning update Solution: push the max towards the bootstrap. Q-learning (Watkins, 1989): given a sampled trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for every t we update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γ · max a∈A(st+1) Qn (st+1, a) − Qn (st, at) 105/294 Q-learning pseudocode source: Sutton&Bato, p. 131 Off-policy: where is the second policy? 106/294 Q-learning convergence Theorem 45 Consider any Q-learning instantiation with infinite exploration. Assume that the sequence of learning rates (αn)n∈N satisfies n αn = ∞ and n α2 n < ∞. In this setting, Q converges to q∗ . Moreover, if the behavior policy is GL, then it converges to an optimal policy π∗ . Proof in optional reading: Watkins, Dayan: Q-Learning. In Machine Learning 8:279-292. Kluwer, 1992. 107/294 SARSA vs. Q-learning (SB: p. 132) Left: greedy policies learned by SARSA and Q-learning. Right: in-training performance with a 0.1-greedy behavior policy. (Rough) takeaway: Q-learning more aggressive in finding optimal policy, can lead to risky behavior. Possibly advantageous when environment not too stochastic or if in-training performance has less importance (simulator vs. real world). 108/294 Maximization bias in Q-learning Q-learning is “risky” not only due to exploration, but also because it is optimistic in the face of uncertainty. TBD The positive bias only disappears in the limit. 109/294 Double Q-learning Idea: use two independent value estimates Q1, Q2 and decouple action selection from evaluation in the bootstrap. During each update, we randomly select one of these for update, which is also used to select the maximizing action in bootstrap. The other is used as the bootstrap estimate. I.e., in each time step t we perform one of these updates, each with probability 1 2 : either Q1(st, at) = Q1(st, at) + αn · rt+1 + γ · Q2 st+1, (arg max a∈A(st+1) Q1(st+1, a)) − Q1(st, at) or Q2(st, at) = Q2(st, at) + αn · rt+1 + γ · Q1 st+1, (arg max a∈A(st+1) Q2(st+1, a)) − Q2(st, at) . Behavior policy = e.g. ε-greedy w.r.t. Q1 + Q2. 110/294 Double Q-learning pseudocode source: Sutton&Barto, p. 136 111/294 Why Double Q-learning helps 112/294 Double Q-learning: experiment source: Sutton&Barto, p. 135 113/294 Between Monte Carlo and TD: n-Step and λ-Returns MC vs TD(0) update targets Given a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .: s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 Update for time step t in: • MC = discounted return from t till the end of trajectory, e.g. for t = 1: r2 + γr3 + γ2 r4 + · · · + γT−2 rT unbiased, but high variance + need the whole trajectory • TD(0) = 1-step reward and then (discounted) bootstrap: r2 + γV (s2) 114/294 n-step return Idea: use n-step discounted return and then bootstrap Definition 46 Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and n ∈ N \ {0}. An n-step return of τ from time step t is the quantity Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2 · rt+3 + · · · γn−1 rt+n + γn · V (st+n). We also define a Q-estimate-based version: Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2 · rt+3 + · · · γn−1 rt+n + γn · Q(st+n, at+n). (Which of the two is used will be clear from the context.) s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 115/294 n-step TD policy evaluation Similar to TD(0), but using n-step return targets: given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an update V (st) ← V (st) + α[Gt:t+n(τ) − V (st)]. Note 1: for n = 1 we get exactly TD(0). Note 2: for n > 1, we cannot update V (st) directly at step t + 1. We need to obtain rt+1, . . . , rt+n, st+n first, i.e. we can perform the update after step t + n. Note 3: if t + n > T, we truncate the sum in Gt:t+n at rT , i.e. in such a case Gt:t+n = Gt. 116/294 n-step TD policy evaluation: pseudocode source: Sutton&Barto, p.144 117/294 n-step TD policy evaluation: performance 19-state symmetric random walk: source: Sutton&Barto, p.145 118/294 n-step SARSA (on-policy control) Uses Q-value-bootstrapped n-step returns. For a sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and for all time steps 0 ≤ t < T we perform an update Q(st, at) ← Q(st, at) + α[Gt:t+n − Q(st, at)]. We sample trajectories according to a policy π that is ε-greedy w.r.t. current Q-estimates: π(a|s) =    1 − ε + ε |A(s)| if a = arg maxa′∈A(s) Q(s, a′ ) (ties broken in principled way) ε |A(s)| otherwise. π is redefined in this way after each episode 119/294 n-step SARSA (pseudocode) source: Sutton&Barto, p. 147 120/294 n-step SARSA (speed of signal propagation) source: Sutton&Barto, p. 147 121/294 n-step Q-learning The Q-learning update for a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and time step t: Q(st, at) ← Q(st, at) + α · [rt+1 + γ · max a∈A(st+1) Q(st+1, a) − Q(st, at)]. Naive extension to n-step returns Q(st, at) ← Q(st, at)+α·[rt+1+γrt+2+· · ·+γt+n−1 ·rt+n+γt+n · max a∈A(st+1) Q(st+n+1, a)−Q(st, at)] does not really correspond to Q-learning, since some of the actions at+1, . . . , at+n might not be Q-greedy (the behavior policy is ε-greedy, so some actions might be exploratory). Hence, we are no longer pushing Q towards the Q-value of an optimal policy. s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 122/294 n-step Q-learning: correct Idea: apply the Q-learning bootstrap at the first occurrence of a non-Q-greedy action. I.e., for each episode: • make π an ε-Q-greedy policy • sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π • for each time step 0 ≤ t < T: • identify the smallest n′ ∈ {t + 1, t + 2, . . . , t + n} such that an′ is not a Q-greedy action • perform the update Q(st , at ) ← Q(st , at ) + α · [rt+1 + γrt+2 + · · · + γn′ −1 rn′ + γn′ · max a∈A(sn′ ) Q(sn′ , a) − Q(st , at )] (if n′ > T, do the standard MC update). s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 123/294 λ-returns: idea By varying n, the n-step returns provide a nice tradeoff between bias and variance (and update speed). But the choice of optimal n is mostly a guesswork. Idea: find a notion of return which combines n-step returns for multiple n’s. E.g., a suitable convex combination of individual n-step returns. This leads to the notion of λ-returns. We will focus only on policy evaluation, though λ-returns can be used also in control. 124/294 λ-returns: definition Recall: Gt:t+n is the n-step return from timestep t. Definition 47: λ-return Let λ ∈ [0, 1]. A λ-return from timestep t is the random variable Gλ t = (1 − λ) ∞ n=1 λn−1 Gt:t+n. Note that due to truncation at t + n ≥ T, the λ-return can be more explicitly written as Gλ t = (1 − λ) T−t−1 n=1 λn−1 · Gt:t+n + λT−t−1 · Gt. s0 a0 s1 a1 s2 a2 s3 a3 s4 r1 r2 r3 r4 125/294 λ-return as discounting of n-step returns source: Sutton&Barto, p. 290 126/294 (Forward-view) TD(λ) Like TD(0), but uses λ-returns. Given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . sampled from the evaluated policy π, we perform, for each time step t an update: V (st) ← V (st) + α(Gλ t (τ) − V (st)). Note: • for λ = 0, this is exactly the TD(0) update, • for λ = 1, this is exactly the MC update, • Gλ t (τ) depends on the whole suffix of τt.., hence the update can be only performed at the end of the episode. (We will show a workaround later.) 127/294 TD(λ) vs. n-step TD on 19-state random walk source: Sutton&Barto, p. 291 128/294 Forward vs. backward view TD(λ) Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view TD(λ) in an online fashion (V (st) can be updated by time t + 1). Implemented using eligibility traces: state-wise signals that indicate how much is the current state eligible for an update (sort of state-wise modulation of the learning rate). We are more keen to update states that: • appear often along the trajectory (frequency heuristic) • were visited in the recent past (recency heuristic) 129/294 Accumulating eligibility trace Definition 48: (Accumulating) eligibility trace For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., λ ∈ [0, 1], and a state s ∈ S, an (accumulating) eligibility trace is a sequence of values E0(s), E1(s), E2(s), . . . defined inductively as follows: E0(s) = 0 and for t > 0 Et(s) = γ · λ · Et−1(s) + I(St(τ) = s), where I(St(τ) = s) is the indicator of the t-th state of τ being s, i.e. I(St(τ) = s) = 1 if st = s and I(St(τ) = s) = 0 otherwise. source: slides of D. Silver (Model-free prediction) 130/294 Backward-view TD(λ): idea • Et(s) denotes how much is s eligible for an update after playing t-th action along the run (i.e., action at−1). • In time step t, all states with non-zero eligibility signal will have their estimates updated in proportion to the learning rate and the strength of the eligibility signal. • The update target is the standard TD(0) target for time t. I.e., for each timestep t and each state q, we perform the update V (q) ← V (q) + α · Et(q) · rt+1 + γ · V (st+1) − V (st) . 131/294 Backward-view TD(λ): pseudocode Input: policy π to evaluate Output: Estimate V of vπ initialize V arbitrarily repeat s ← sample uniformly (ES) or according to init. distr. initialize E to be uniformly zero while s not terminal do a ← sample from π(s) s′ ← sample from p(s, a) r ← r(s, a, s′ ) foreach q ∈ S (Only q’s visited so far) do E(q) = γ · λ · E(q) + I(q = s) V (q) ← V (q) + α · E(q) · r + γ · V (s′ ) − V (s) s ← s′ until timeout 132/294 Forward vs. backward view If λ = 0, then Et(q) = I(St(τ) = q), i.e. the backward-view update at time point t is V (st) ← V (st) + α · r + γ · V (st+1) − V (st) , while for all states other than st, no update is performed. I.e., backward TD(0) is exactly the same thing as forward TD(0). For general λ the correspondence is more subtle: Theorem 49: Forward-backward view correspondence Assume that in the backward view, all the updates along the trajectory are performed offline, i.e. only after the end of the episode, and in a batch, i.e. concurrently, using the pre-episode estimates in right-hand sides. Then, for any λ ∈ (0, 1), this offline backward TD(λ) performs the same updates as forward TD(λ). 133/294 Offline backward and forward view (source: D. Silver slides) Given the batch nature of updates, it suffices to show that the forward update target at time t equals the sum of all updates “triggered” by a visit to state st. 134/294 Online vs. offline backward view on 19-state RW In practice, we want to use the online backward algorithm, which only approximates the forward view. Nevertheless, it performs acceptably: source: Sutton&Barto, p. 295, LEFT: online backward TD(λ), RIGHT: offline forward TD(λ) 135/294 Concluding remarks on λ-returns • There are other types of eligibility traces (replacing, dutch, . . . ), yielding different algorithms. • Eligibility traces neatly generalize to deep learning, where they are not state-wise but parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2 • λ-returns and eligibility traces can be generalized to control setting SARSA(λ), Q(λ); optional reading: Sutton&Barto, Sec. 12.7-12.10. • There is a true online backward TD(λ) version. Here, true online=having perfect equivalence with the forward view. However, the equivalence is w.r.t. a more complex notion of λ-return (truncated λ-return) and uses more complex version of eligibility traces than presented here. Outperforms both forward and backward algorithms presented here. Optional reading: van Seijen, Sutton: True Online TD(λ). In Proceedings of ICML’14. 136/294 First Steps Towards Deep RL: Value-Based On-Policy Methods Working with huge MDPs E.g. original Atari games have 160x192 resolution with 128 colors: observable state space of size 27·160·192 = 2215040 (though only a fraction reachable and resolution typically scaled down in benchmarks – however, state typically encompass last 3 frames so as to provide some info on movement). State space can be even continuous (position, velocity,. . . ). Most states will not be seen - we need the ability to generalize from experience to unseen/rarely seen states. 137/294 Huge MDP representation From now on, states of the MDP will be represented by vectors from Rn . The vectorized representation is chosen in a domain-specific manner, e.g.: • Atari = one component per pixel per frame • continuous navigation = agent coordinates, velocity, etc. • small discrete MDPs can be represented by one-hot encoding For simplicity, we will still assume that the action space is discrete, and reasonably small, though many algorithms can be adapted for continuous actions (acceleration, etc). 138/294 Function approximators The value functions have types: vπ , v∗ : Rn → R qπ , q∗ : Rn × A → R. In RL, we need to approximate these functions. Definition 50 A function approximator (FA) for functions of type X → Y is a class of functions f ⊆ Y X parameterized by a some set of parameter vectors Θ ⊆ Rn . Each concrete parameter vector θ ∈ Θ defines a concrete function fθ ∈ f , i.e. f = {fθ | θ ∈ Θ}. For FA f , we often write fθ(x) = f (x, θ) to stress the fact that the output of fθ depends on both the input x and on θ. Hence, FA for type X → y can be itself seen as a function of type X × Θ → Y . 139/294 Function approximators in RL Our algorithms will use mainly these types of function approximators: • V : Rn × Θ → R to approximate vπ or vθ • Q : (Rn × A) × Θ → R to approximate qπ or qθ The typical task is to find θ ∈ Θ such that Vθ = V (·, θ) is a “good” approximation for vπ or vθ , and similarly for Qθ. The parametrization Θ will depend on the concrete form of function approximator used. 140/294 Forms of function approximators • tabular • θ = the vector containing the contents of the table • linear • Θ = S = Rn and e.g. Vθ(s) = θ⊤ · s • neural nets • θ = NN weights and biases • decision trees • . . . We require the approximators to be differentiable and to admit a training method suitable for non-stationary data. 141/294 Neural nets (source: slides by T. Br´azdil) 142/294 Policy evaluation with FAs Task: given a policy π and FA V : Rn × Θ → R, find θ s.t. Vθ is “close” to vπ . “Closeness” can be expressed using various loss functions. Typically, we want to minimize the mean squared error (MSE): MSE(vπ , Vθ) = 1 2 Es∼µ (vπ (s) − Vθ(s))2 = 1 2 s∈S µ(s) · (vπ (s) − Vθ(s))2 , where µ is some distribution over states expressing how much do we care about errors in particular states. A local minimum of MSE can be found gradient descent: making successive step in the direction opposite to the gradient of MSE. 143/294 Recall: gradients Definition 51 Given a scalar function f (x1, . . . , xn, θ1, . . . , θm): Rn × Θ → R (where Θ ⊆ Rm ), the gradient of f w.r.t. parameters θ = (θ1, . . . , θm) is the vector function ∇θf = ( ∂f ∂θ1 , . . . , ∂f ∂θm ) of type Rn × Θ → Rm When f is a function approximator defined by a neural net, the value of the gradient ∇θf (x, θ) at a given point (x, θ) = (x1, . . . , xn, θ1, . . . , θm) can be computed by backpropagation (under some usual conditions like smoothness, etc.). 144/294 Gradient descent for policy evaluation To (locally) minimize MSE(vπ , Vθ), it suffices to perform (sufficiently small) steps in the negative direction of the current gradient, i.e., repeatedly perform updates: θ ← θ − α · ∇θMSE(vπ , Vθ) = θ − α · ∇θ 1 2 · Es∼µ vπ (s) − Vθ(s) 2 = θ − α 2 · Es∼µ ∇θ vπ (s) − Vθ(s) 2 = θ + α · Es∼µ vπ (s) − Vθ(s) · ∇θVθ(s) The expected value above is typically impossible to evaluate in practice. Instead we estimate it by samples ⇒ stochastic gradient descent. We typically take µ(s) representing the overall fraction of time spent in s when behaving according to µ. Hence, Es∼µ can be estimated by sampling a trajectory from µ and performing the update for each s on the trajectory in an every-visit fashion. 145/294 Stochastic gradient policy evaluation + MC instantiation We keep sampling trajectories τ from π: s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 For each timestep t we perform the update of parameters θ ← θ + α · vπ (st) − Vθ(st) · ∇θVθ(st) . Problem: in policy evaluation setting, we do not know vπ (st). Hence, we estimate it using RL targets. The simplest is the Monte Carlo target: estimate st by the discounted return of the sampled trajectory from st, i.e. perform updates of the form θ ← θ + α · Gt(st) − Vθ(st) · ∇θVθ(st) . 146/294 Gradient Monte Carlo policy evaluation: pseudocode Algorithm 3: Gradient MC evaluation Input: Policy π, FA V : S × Θ → R, step size α Output: Approximation Vθ of vπ initialize θ arbitrarily; repeat sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π; foreach t ∈ {0, . . . , T − 1} do θ ← θ + α · [Gt(τ) − V (st, θ)] · ∇θV (st, θ) until timeout; 147/294 Semi-gradient TD(0) In the gradient update formula θ ← θ + α · vπ (st) − Vθ(st) · ∇θVθ(st) . we can also estimate vπ (st) with the TD(0) target: θ ← θ + α · rt+1 + γ · Vθ(st+1) − Vθ(st) · ∇θVθ(st) . This yield the semi-gradient TD(0) policy evaluation algorithm. Why semi-gradient? 148/294 Gradient vs. semi-gradient TD(0) Recall that our ultimate goal is to minimize MSE(vπ , Vθ) = 1 2 Es∼µ (vπ (s) − Vθ(s))2 . The gradient of this loss is ∇θ 1 2 Es∼µ (vπ (s) − Vθ(s))2 = 1 2 Es∼µ ∇θ(vπ (s) − Vθ(s))2 , Estimation with sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and substituting vπ (s) with the TD(0) target would yield ∇θMSE(vπ , Vθ) ≈ 1 2 ∇θ rt+1+γVθ(st+1)−Vθ(st) 2 = (rt+1+γVθ(st+1)−Vθ(st))·(γ∇θVθ(st+1)−∇θVθ(st)), different update then semi-gradient TD(0)! However, this full gradient: • is more expensive to compute (2 backpropagations per update); • does not really express TD(0) idea (the update target is not fixed). 149/294 Semi-gradient TD(0): pseudocode Algorithm 4: Semi-gradient TD(0) evaluation Input: Policy π, FA V : S × Θ → R, step size α Output: Approximation Vθ of vπ initialize θ arbitrarily; repeat s ← initial state; a ∼ π(s); while s not terminal do s′ ∼ p(s, a); r ← r(s, a, s′ ); a′ ∼ π(s′ ); θ ← θ + α · [r + γ · V (s′ , θ) − V (s, θ)] · ∇θV (s, θ); s ← s′ ; a ← a′ until timeout; 150/294 On-policy control with function approximation Semi-gradient SARSA uses the same idea as TD(0), but with Q-approximator, i.e. Q : Rn × A × Θ → R. Behavior policy = e.g. ε-greedy with respect to the current Q. For a sampled trajectory τ = s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 we perform, in each timestep t, an update θ ← θ + α · rt+1 + γ · Qθ(st+1, at+1) − Qθ(st, at) · ∇θQθ(st, at) . 151/294 Semi-gradient SARSA: pseudocode Algorithm 5: Semi-gradient SARSA Input: FA Q : S × A × Θ → R, step size α Output: Approximation Qθ of q∗ initialize θ arbitrarily; repeat s ← initial state; π ← policy ε-greedy w.r.t. Qθ; a ∼ π(s); while s not terminal do s′ ∼ p(s, a); r ← r(s, a, s′ ); a′ ∼ π(s′ ); θ ← θ + α · [r + γ · Qθ(s′ , a′ ) − Qθ(s, a)] · ∇θQθ(s, a); s ← s′ ; a ← a′ until timeout; 152/294 Representing actions in DNNs How to represent actions in the (say, DNN) function approximator Q is largely a domain-dependent engineering choice. If the set of actions A = {a1 , . . . , ak } is discrete and reasonably small, we can consider a net which inputs a state (i.e., n input neurons when S = Rn ) and outputs an |A|-dimensional vector (i.e., one output neuron per action), so that the output of the i-th neuron on input s is interpreted as Q(s, ai ). I.e., in such a case we consider Q to be function of type Q : S × Θ → R|A| . 153/294 On-policy semi-gradient methods: concluding remarks The presented algorithms can be instantiated also with other types of returns, such as: • n-step returns • n-step SARSA update: θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1 rt+n + γn · Qθ(st+n, at+n) Gt:t+n,θ −Qθ(st , at ) ·∇θQθ(st , at ) • forward-view λ-returns • SARSA(λ) update: θ ← θ + α · (1 − λ) ∞ n=1 λn−1 Gt:t+n,θ − Qθ(st , at ) · ∇θQθ(st , at ) • backward-view λ-returns (optional reading: Sutton&Barto, sections 12.2 and 12.7) 154/294 Value-Based Off-Policy Control with Approximators: DQNs and Friends Off-policy methods with function approximation ...are tricky to get right, already in the case of policy evaluation. The training can become very unstable. For on-policy (semi)-gradient methods, one can typically prove convergence to correct/optimal values at least in the case of linear function approximation (though not in the more general case of NN approximators). Off-policy semi-gradient methods, such as: • TD with importance sampling (not covered here), or • Q-learning with function approximators (will be covered a bit later), can diverge already with linear function approximators. 155/294 Divergence examples (high-level) • Baird’s counterexample: semi-gradient TD with importance sampling can diverge in presence of linear FAs • Moreover, the divergence is not due to the instability of (semi)-gradient descent. Tsitsiklis and Van Roy’s counterexample shows divergence even in the case where each update completely replaces the current θ with the optimal θ∗ which minimizes the MSE between Vθ and the TD(0) update target. The problem lies in the off-policy distribution of updates. • Counterexamples explained in optional reading: Sutton&Barto, Sec. 11.2. 156/294 Deadly triad Identified by Sutton&Barto: risk of training instability and divergence steeply rises when combining: • function approximation, • bootstrapping, and • off-policy training. But often we want to do just that. :) Practical solution: Happily do the deadly triad, but use insights from supervised learning to develop additional techniques that help stabilize the training. 157/294 Deep Q-Networks (DQN) 2013 arXiv tech. report, there is also follow-up 2015 Nature paper 158/294 Q-learning with function approximators Same semi-gradient idea as in TD(0), SARSA: adjust θ to bring Qθ(s, a) closer to the fixed Q-learning update target. I.e., for a sampled trajectory s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 and its timestep t, the update is θ ← θ + α · rt+1 + γ · max a∈A(st+1) Qθ(st+1, a) − Qθ(st, at) · ∇θQθ(st, at) . But performing the updates based solely on the current step would be susceptible to instability due to the presence of the deadly triad. 159/294 Deep Q-learning challenges From Mnih et al. Playing Atari with Deep Reinforcement Learning: 160/294 Experience replay Originated in the work of Long-Ji Lin, e.g.: Reinforcement Learning for Robots Using Neural Networks, dissertation, 1993. Definition 52: Experience An experience is a 4-tuple (s, a, r, s′ ) ∈ S × A × S × R interpreted as (”state”, ”action played in it”, ”reward obtained”, ”next state observed”). • DQN does not perform update based only on the current step. Instead, for each sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and each timestep t it: • first stores the one-step experience (st , at , rt+1, st+1) in a data structure B called replay buffer; • then, sample a random minibatch of experiences B ⊆ B of a given minibatch size Bsize • perform a minibatch-gradient-descent update w.r.t. B: compute the gradient of the Q-learning loss for each e ∈ B and then update θ in the direction of an average gradient over the whole B. 161/294 Minibatch update Fix a minibatch B. For each e = (s, a, r, s′ ) ∈ B we compute the gradient of the Q-learning loss ∇θL(θ, e) at point e: ∇θL(θ, e) = ∇θ 1 2 r + γ · max b∈A(s′) Qθ(s′ , b) fixed target − Qθ(s, a) 2 = r + γ · max b∈A(s′) Qθ(s′ , b) =0 if s terminal − Qθ(s, a) · ∇θQθ(s, a) We then perform an update in the direction of average gradient: θ ← θ + α · 1 |B| e∈B ∇θL(θ, e). 162/294 Experience replay rationale • helps decorrelate the DNN training data • helps to prevent catastrophic forgetting • improves data efficiency via experience re-use Why good match for deep Q-learning? Experience replay is by design off-policy since we train on old data, which were sampled from different policy than the current one. 163/294 Replay buffer implementation The replay buffer B is typically not unbounded, but has a fixed capacity Bsize. Replacement is eventually needed. If B is full, the oldest experience if removed (B = queue). How to sample the minibatches? • Original DQN: uniformly from B. • Alternative: prioritized experience replay: each experience is assigned a priority (several heuristics exist). An experience is sampled into a minibatch with probability proportional to its priority. 164/294 DQN: 2013 pseudocode Algorithm 6: DQN with replay buffer Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s ε, Bsize, Bsize, . . . Output: Approximation Qθ of q∗ initialize θ arbitrarily; initialize empty replay buffer B of capacity Bsize; repeat s ← initial state; while s not terminal do π ← policy ε-greedy w.r.t. Qθ; a ∼ π(s); s′ ∼ p(s, a); r ← r(s, a, s′ ); store (s, a, r, s′ ) in B; sample a minibatch B of size Bsize from replay buffer B; perform the minibatch update θ ← θ + α · 1 Bsize e∈B ∇θL(θ, e) (see this slide); s ← s′ ; until timeout; 165/294 DQN: 2013 results Mnih et al. (2013, arXiv) 166/294 Target networks: another stabilizing factor in DQNs Introduced in the reviewed version of DQN paper: Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518 (2015). Performing the (minibatch) Q-update looks like supervised learning: change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max b∈A(s′) Qθ(s′ , b), where (s, a, r, s′ ) is the processed experience. Performing the (minibatch) Q-update looks like supervised learning, but: change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max b∈A(s′) Qθ(s′ , b) the “label” changes with each update! , 167/294 Target networks: idea To stabilize learning, we use two networks: the main network, and the target network. They have the same architecture (denote it by Q), but their weights may differ during the execution of the algorithm. We denote: • θ - weights of main network • ˆθ - weights of target network Usage: • The target network is used only to compute TD targets when computing losses: ∇θL(θ, e) = r + γ · max b∈A(s′) Q(s′ , b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ) • At the start, and also in periodic intervals (but not after each update!) the two networks are synchronized by performing ˆθ ← θ. Other than this, ˆθ stays fixed, the gradient steps are only used to update θ (i.e., the main network). 168/294 DQN: 2015 pseudocode Algorithm 7: DQN with replay buffer and target network Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s ε, Bsize, Bsize, C, . . . Output: Approximation Qθ of q∗ initialize θ arbitrarily; ˆθ ← θ; counter ← C; initialize empty replay buffer B of capacity Bsize; repeat s ← initial state; while s not terminal do if counter = 0 then ˆθ ← θ; counter ← C else counter ← counter − 1; π ← policy ε-greedy w.r.t. Qθ; a ∼ π(s); s′ ∼ p(s, a); r ← r(s, a, s′ ); store (s, a, r, s′ ) in B; sample a minibatch B of size Bsize from replay buffer B; perform the minibatch update θ ← θ + α · 1 Bsize e∈B ∇θL(θ, e), where ∇θL(θ, e) = r + γ · maxb∈A(s′) Q(s′ , b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ); s ← s′ ; until timeout; 169/294 DQN: 2015 results 170/294 Engineering behind DQN for Atari: partial observability True state = current state (program counter + variable values) of the game program. We do not see this – only the frames rendered on screen. Solving partially observable environments requires (per some POMDP theory) making decisions based on the whole history of observations. This is computationally demanding (recurrent NNs...). DQN for Atari solves this by feeding the last 4 observed frames into the NN. This is typically enough to deduce the dynamics of the current play. 171/294 DQN: dynamics from limited frame history 172/294 Engineering behind DQN for Atari: the network Inputs four 84x84px images, then 3 convolutional layers, then two fully connected layers. all with ReLU activations. Outputs Q-estimate for each action. source: Mnih et al. (Nature, 2015), details in appendix “Model architecture” 173/294 DQN for Atari: dirty engineering tricks • Preprocessing: 210x160 RGB color images are converted to grayscale and resized to 84x84 resolution. • Frame skipping: the agent only observes and acts in every K-th frame, for the frames in between, the last selected action is repeated without providing the frame to the agent. (In the paper, K = 4.) • Reward clipping: all positive one-step Atari rewards are clipped to +1, all negative ones are clipped to −1 (Atari gives integer rewards). • TD error clipping: for each update, the Q-learning error r + γ · max b ∈ A(s′ )Q(s′ , b) − Q(s, a) is clipped to [−1, 1]. 174/294 DQN for Atari: selected hyperparameters (nex) minibatch size Bsize 32 replay buffer size Bsize 1,000,000 target network update freqeuency C 10,000 discount factor γ 0.99 update frequency (steps between two minibatch updates) 4 learning rate 0.00025 initial ε 1 final ε (linear decay) 0.1 final decay frame 1,000,000 random policy played for init. 50,000 frames max. do-nothing actions at episode start 30 175/294 RAINBOW of heuristics Multitude of heuristics for the improvement of DQN were developed over time. Some of them make sense also in the context of other deep RL algorithms. The RAINBOW agent combines six such heuristics to further improve the DQN performance on Atari games. (In proceedings of AAAI 2018.) 176/294 Rainbow heuristics • dueling networks architecture • double DQN • prioritized experience replay • n-step rewards • distributional learning • noisy networks 177/294 Action advantage Idea: imagine that for some state s, the Q-values of all actions are high. Then s should be in some sense valuable in itself. Definition 53: Advantage Let π be a policy. An advantage function advπ : S × A → R is defined as advπ (s, a) = qπ (s, a) − vπ (s). Dueling architecture splits certain layers of the neural network into two “streams”, one estimating (somethign like) vπ (s) and one estimating (something like) advπ (s, a). The final layer combines these estimates to produce an estimate of qπ (s, a). 178/294 Dueling architecture Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In proceedings of ICML’16. 179/294 Dueling architecture: theory and training We have Qθ,α,β(s, a) = aggregate(Vθ,α(s), Aθ,β(s, a)), where • θ - convolutional (or other feature extraction) layer parameters • α - value channel parameters • β - advantage channel parameters The whole network Q is trained to estimate qπ (where π is the target policy) using any deep RL algorithm (e.g. DQN, in which case π is the optimal policy). There is nothing new from RL perspective here, all the novelty is inside the network. The factorization into state value and advantage is supposed to help the network “focus” on features that are important to recognize valuable states and features that help us rank actions. 180/294 Dueling architecture: Atari example From Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In proceedings of ICML’16. 181/294 Learning state values and advantages • The whole net is trained end-to-end to predict qπ . • How do we ensure the value/advantage channels are trained to predict state values/advantages? • By a suitable choice of aggregator: • Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0 and Aθ,β to qπ . • Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − max b∈A(s) Aθ,β(s, b), the training then indeed pushes Vθ,α to vπ and Aθ,β to advπ + c where c is some constant. Issues: not differentiable, update sensitive to the value of maximizing action changes. 182/294 Aggregation in Rainbow The point: aggregating layer should anchor the sum of the channels to same baseline value derived non-trivially from the advantages (if all advantages shift up/down, so should the baseline). Rainbow uses mean advantage baseline: Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − 1 |A(s)| b∈A(s) Aθ,β(s, b) pushes the value channel to predict 1 |A(s)| a∈A(s) qπ (s, a). 183/294 Double DQN van Hasselt, Guez, Silver: Deep Reinforcement Learning with Double Q-learning. In proceedings of AAAI 2016. Similar idea to tabular Double Q-learning (use different estimates for selecting maximizing action in bootstrap and for evaluating the bootstrap), but instead of independently updated networks uses main and target networks. I.e., for experience (s, a, r, s′ ), the update is: θ ← θ + α · r + γ · Qˆθ s′ , arg max b∈A(s′) Qθ(s′ , b) − Qθ(s, a) ∇θQθ(s, a), where ˆθ is the parameter vector of the target network. 184/294 Prioritized experience replay (Schaul et al., ICLR’16) Each experience e = (s, a, r, s′ ) in the replay buffer is assigned a priority according to its TD-error pe = |r + γ · max b∈A(s′) Qˆθ(s′ , b) − Qθ(s, a)| + ε (ε > 0 ensures all priorities are positive). The probability of sampling an experience e from the buffer is set to pα e e′∈B pα e′ , where α > 0 is a hyperparameter controlling the degree of prioritization. Prioritization induces bias: the sampled experiences no longer follow the same distribution as sampled trajectories. We can correct this by using importance sampling during updates: θ ← θ + α · 1 |B| · 1 pα e β · r + γ · Qˆθ s′ , arg max b∈A(s′) Qθ(s′ , b) − Qθ(s, a) · ∇θQθ(s, a), where β > 0 determines the degree of IS correction (annealed to 1 during training). 185/294 n-step returns Self-explanatory, use n-step return with Q-learning bootstrap when computing TD target. How to combine with replay buffer? Each experience stores a single step. Solutions: • Store experiences in B sequentially, with each sampled experience, retrieve also the next n − 1 ones (up to episode termination). Requires careful implementation. • Naive: each element of B consists of n consecutive experiences (space inefficient). Given consecutive experiences (st, at, rt+1, st+1), (st+1, at+1, rt+2, st+2), . . . , ((st+n−1, at+n−1, rt+n, st+n)), perform update θ ← θ + α · rt+1 + γrt+2 + · · · + ·γn−1 rt+n + γn max b∈A(st+n) Qˆθ(st+n, b) − Qθ(st, at) ∇θQθ(st, at). 186/294 Distributional learning Very rough idea: instead of expected returns, predict (discretized) distribution of returns. Source: Bellemare, Dabney, Munos: A Distributional Perspective on Reinforcement Learning. In proceedings of ICML’17. We still optimize the expected value of the distribution, but the NN processes richer information (neurological inspiration). 187/294 Noisy nets Fortunato et al.: Noisy Networks for Exploration . In proceedings of ICRL’17. An alternative way of achieving exploration (without ε-greedy policies). Replaces linear layers y = W · x + b with noisy layers of the form y = (µw + σw ⊙ εw ) · x + µb + σb ⊙ εb, where matrices µw , σw and vectors µb, σb are learnable, matrix εw and vector εb consist of random noise, and ⊙ represents component-wise multiplication. The loss function of the DQN training is then encapsulated in expectation over the noise. Interesting point: the net can learn to adjust σ’s and thus the degree of exploration over time. 188/294 Rainbow: evaluation and ablations (Hessel et al., 2017) 189/294 Policy Gradient Methods Value-based vs. policy-based methods So far: focus on approximating q∗ via some parameterized estimate Qθ, policy the defined by Qθ, e.g. (ε-)-greedy... In policy gradient methods we work directly with some parameterized representation of a policy πθ, and update θ so as to improve some performance characteristic of πθ (e.g., expected return). In particular, the policy π can be represented by a function approximator πθ : S × Θ → D(A). 190/294 Standard NN + softmax representation πθ(a|s) = eh(s,a,θ) b∈A eh(s,b,θ) , where h(s, a, θ) is the logit (”preference”) of action a 191/294 General policy gradient scheme We want to find θ that maximizes some performance measure (or objective function) J(θ) of πθ. The obvious choice for J is the expected return: J(θ) = vπθ = Eπθ [G], thought some algorithms use a different surrogate objective function. The optimization problem max θ J(θ) can be (locally) solved using gradient ascent: repeatedly perform updates θ ← θ + α · ∇θJ(θ). It is thus necessary to compute or approximate ∇θJ(θ): this is the scope of various policy gradient theorems. 192/294 Gradient of expected return: possible version ∇θJ(θ) = ∇θEπθ [G] = 193/294 Gradient of expected return (cont’d) 194/294 Vanilla MC policy gradient Algorithm 8: Vanilla MC policy gradient Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, learning rate α Output: Approximation πθ of π∗ initialize θ arbitrarily; repeat s0 ∼ initial distribution; generate episode τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , rT , sT using πθ; θ ← θ + α · G(τ) · i t=0 ∇θ log πθ(aj |sj ) until timeout; 195/294 Score function form 196/294 Step-wise gradient ∇θJ(θ) = ∇θEπθ [G] = ∇θEπθ [ ∞ t=0 γt Rt+1] 197/294 Step-wise gradient (cont’d) 198/294 REINFORCE: better MC policy gradient Algorithm 9: REINFORCE (Williams, 1992) Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, learning rate α Output: Approximation πθ of π∗ initialize θ arbitrarily; repeat s0 ∼ initial distribution; generate episode τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , rT , sT using πθ; G ← 0; for t = T − 1 to 0 do G ← rt+1 + γ · G; θ ← θ + α · γt · G · ∇θ log πθ(at|st) until timeout; 199/294 Baseline in policy gradient Theorem 54: Baseline theorem Let bθ(s): S × θ → R be any function. Then for any t: Eπθ bθ(st) · ∇θlogπθ(at|st) = 0. As a consequence ∇θJ(θ) = Eπθ T t=0 γt · Gt · ∇θ log πθ(at|st) = = Eπθ T t=0 γt · Gt−bθ(st) · ∇θ log πθ(at|st) The gradient estimates using baseline have the same expectations as the standard REINFORCE estimate, but might have a lower variance if baseline selected correctly. 200/294 State-value baseline Good choice is b(s) := vπθ (s), reducing the estimate variance by correcting the return for a bias caused by being in a certain state. Problem: we do not know vπθ . Solution: Learn vπθ online using a separate function approximator V , e.g. via gradient (every-visit) Monte Carlo. 201/294 REINFORCE with a state-value baseline Algorithm 10: REINFORCE with baseline Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, value parametrization Vη, learning rates απ, αV for the two approximators Output: Approximation πθ of π∗ initialize θ and η arbitrarily; repeat s0 ∼ initial distribution; generate episode τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , rT , sT using πθ; G ← 0; for t = T − 1 to 0 do G ← rt+1 + γ · G; η ← η + αV · G − Vη(st) · ∇ηVη(st); θ ← θ + απ · γt · G − Vη(st) · ∇θ log πθ(at|st) until timeout; 202/294 REINFORCE: baseline effect experiment source: Sutton&Barto, p. 330 203/294 Advantage estimation in policy gradient Comparing returns to some state-dependent baseline is reminiscent of what happened in the dueling network architecture. In particular, one can derive another form of the policy gradient, showing that ∇θJ(θ) ∝ Es∼π,a∼πθ [qπθ (s, a) · ∇θ log π(a|s)]. Inputting a state-value baseline yields ∇θJ(θ) ∝ Es∼π,a∼πθ [(qπθ (s, a) − vπ (s)) advπ(s,a) ·∇θ log π(a|s)]. Hence, policy-gradient-type algorithms often formulate the update of policy parameters in the form θ ← θ + α · At θ,η(st, at) · ∇θ log πθ(at|st), where At θ,η is some approximator called advantage estimate. E.g. in PG with baseline, At (θ,η) = γt · (Gt − Vη(st)). 204/294 Proof of Baseline theorem 205/294 Actor-critic: Policy gradient with bootstrapping Recall the REINFORCE-with-baseline update: θ ← θ + α · γt · (Gt − Vη(st)) · ∇θ log πθ(at|st) Gt is a possible source of variance: let’s remove it (at the cost of introducing bias) via bootstrapping! E.g. TD(0): θ ← θ + α · γt · (rt + γ · Vη(st+1) − Vη(st)) · ∇θ log πθ(at|st). Here, the value network Vη both estimates the baseline value of the current state, and (via boostrap) the quality of the played action: we call it a critic, while the policy network πθ is called an actor. Note: we can use the same bootstrap to update critic parameters. 206/294 Basic Actor-Critic (AC) algorithm: pseudocode Algorithm 11: One-step AC Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, value parametrization Vη, learning rates απ, αV for the two approximators Output: Approximation πθ of π∗ initialize θ and η arbitrarily; repeat s ∼ initial distribution; D ← 1; while s is not terminal do a ∼ πθ(s); s′ ∼ p(s, a); δ ← r(s, a, s′ ) + γ · Vη(s′ ) − Vη(s); η ← η + αV · δ · ∇ηVη(st); θ ← θ + απ · D · δ · ∇θ log πθ(at|st); D ← γ · D; s ← s′ until timeout; 207/294 AC: picture 208/294 Comments on AC • The algorithm presented on previous slide is just a basic variant: actor-critic framework covers a wide range of algorithms: soft Actor-Critic (SAC), A2C, A3C, deep deterministic policy gradient (DDPG), twin-delayed DDPG (TD3), PPO (next time),. . . 209/294 Varieties of policy gradient heuristics • Entropy regularization: add to the objective function a term representing the entropy of the policy: we prefer “more” randomized policies to encourage exploration (used e.g. in SAC). • JENTR (θ) = Eπ [G] + β · Es∼πθ [H(π(s))] • Off-policy training by using replay buffer (e.g. in SAC, DDPG, TD3). • Using parallel agents whose gradients are averaged for each update (A2C). • Using n-step (e.g. A2C) or λ-returns in bootstrap (PPO). • Using different J(θ) than just expected return (SAC, TRPO, PPO). Note that the above algorithms typically differ from “vanilla” AC in more aspects then presented above. We shall see soon on the case of PPO. 210/294 Taming Unstable Gradients with Trust Regions: TRPO, PPO Limitations of policy gradient methods • Basic policy gradient methods are prone to large variance of gradient estimates. • These can be mitigated to some degree, e.g. by using n-step or λ-returns in advantage estimation. • Even then, the methods can yield large updates which can destabilize the training. We are never sure the parameter updates will actually lead to an improvement of a policy: 211/294 Sensitivity of policy to parameters For two actions and π(a|s) = 1 1−e−θ a = a1 1 − 1 1−e−θ a = a2, we get Source: slides by Emma Brunskill, Lecture 6, https://web.stanford.edu/class/cs234/modules.html 212/294 Direct policy improvement via surrogate objectives • We will present algorithms that use different performance metric so as increase the chance of policy improvement on update. • Moreover, we will design the performance metric so that its gradient can be easily computed by an automated differentiation tool without the need to derive the formulas manually. 213/294 Overall structure of presented algorithms • The algorithms will generate a sequence of policies π1, π2, π3 such that each policy will be (likely) an improvement over the previous one. • Each step from πi to πi+1 entails finding a policy πi+1 that optimizes some performance metric Lπi that is dependent on the previous policy πi ! (Cf. standard policy gradient: the performance metric J evaluated only the current policy). • I.e. a run of such an algorithm entails solving (using gradient-based methods) multiple optimization problems: one per each policy update. • We will now focus on the single improvement step: we will denote by • π = πθ the previous policy (πi ) • π′ = πθ′ the new policy (πi+1) we seek. θ is treated as a constant, θ′ as variables! 214/294 Roles of advantages in policy improvement Theorem 55 Let θ, θ′ be two parameter vectors and π = πθ, π′ = πθ′ . Then J(θ′ ) − J(θ) = Eτ∼π′ ∞ t=0 γt · advπ (st, at) def =Lπ(π′) . 215/294 Another loss surrogate To ensure that update from θ to θ′ is an improvement, we want to maximize Lπ(π′ ) = Eτ∼π′ ∞ t=0 γt · advπ (st, at) But we cannot sample from π′ , neither can we easily compute the gradient of the loss by automated differentiation. Trick: the loss function Lπ behaves similarly to the following loss function Lπ for all points π′ that are “close enough” to π: Lπ(π′ ) = Eτ∼π ∞ t=0 γt · advπ (st, at) · π′ (at|st) π(at|st) . 216/294 Closeness of Lπ and Lπ Lπ(π) = Eτ∼π′ ∞ t=0 γt · advπ (st, at) Lπ = Eτ∼π ∞ t=0 γt · advπ (st, at) · π′ (at|st) π(at|st) “Behaves similarly” can be formalized as follows: Theorem 56 It holds Lπ(π) = Lπ(π). Moreover the gradients ∇θ′ Lπ and ∇θ′ Lπ are equal at point π. Proof: the first part is trivial. The second part is technical and requires converting the expectation into expectation-over-states form, see Optional reading: Schulman, Levine, Abbeel, Jordan, Moritz: Trust Region Policy Optimization. In Proceedings of ICML’15. 217/294 Trust Region Methods Hence, the constrained optimization problems maximize Lπ(π′ ) subject to π′ close to π maximize Lπ(π′ ) subject to π′ close to π have approximately the same optimal solutions. We want to solve the first, and will proceed by solving the second. The set of π′ that are “close enough” to π is called a trust region. What “close enough” means differs among algorithms. Exact bounds on the error of the approximation in terms of KL divergence between π and π′ was given in Optional reading: Achiam, Held, Tamar, Abbeel: Constrained Policy Optimization. In Proceedings of ICML’17. 218/294 Optimizing the loss by sampling and automated differentiation Lπ = Eτ∼π ∞ t=0 γt · advπ (st, at) · π′ (at|st) π(at|st) We replace the true advantage advπ with an advantage estimator Aη (neural net, more on that later). Instead of optimizing the true loss Lπ, we optimize a sample loss Lπ: for a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .: Lπ(π′ ) ≈ Lπ(π′ ) = ∞ t=0 γt · Aη(τ, t) · π′ (at|st) π(at|st) (More trajectories can be sampled, in which case we optimize the average sample loss over the trajectories.) We then let an automated gradient-based optimizer find π′ = πθ′ maximizing Lπ(π′ ). Note that the only term in Lπ that depends on the optimized parameters θ′ are the likelihood ratios! Hence, the gradient can be computed easily. 219/294 Trust region policy optimization (TRPO) Schulman, Levine, Abbeel, Jordan, Moritz: Trust Region Policy Optimization. In Proceedings of ICML’15. To perform update from π to π′ , theoretical TRPO: • samples a trajectory τ (or a batch of trajectories) from π • uses the trajectory to: • update the advantage estimator Aη (i.e., update its parameters η) • construct the loss Lπ • solves the optimization problem maximize Lπ(π′ ) − β · DKL(π, π′ ) JTRPO (DKL - KL divergence, β - suitable constant) via a black-box gradient-based optimizer. This update loop is performed until timeout. Practical TRPO makes several changes to individual steps of the above scheme. 220/294 Proximal Policy Optimization (PPO) 221/294 PPO vs TRPO PPO follows the same general scheme as TRPO with the following tweaks: • Explicitly specifies how the advantage should be estimated: generalized advantage estimation (essentialy, truncated offline λ-returns, see later). • Tweaks the loss function a bit - uses different way of ensuring that π′ is in the proximity of π. PPO comes in two variants depending on how it tweaks the loss function: • Adaptive KL divergence penalty coefficient: like TRPO, but the coefficient β changes adaptively. Empirically does not perform as well as the second method: • Clipped likelihood ratios. Definition 57: Clipping The function CLIP is defined as follows: CLIP(x, a, b) =    x if a ≤ x ≤ b a if x < a b if b < x 222/294 Basic PPO empirical loss For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . denote rt(π, π′ ) = π′ (at|st) π(at|st) The main component of PPO loss/performance metric is: LCLIP π (π′ ) = ∞ t=0 min Aη(τ, t) · rt(π, π′ ) , Aη(τ, t) · CLIP rt(π, π′ ), 1 − ε, 1 + ε (Note that no γt is included in the formula: discounting to some degree implicitly encompassed in the advantage estimation.) 223/294 Constraining improvements but not damages LCLIP π (π′ ) = ∞ t=0 min Aη(τ, t) · rt(π, π′ ) , Aη(τ, t) · CLIP rt(π, π′ ), 1 − ε, 1 + ε source: Schulman et al.: Proximal Policy Optimization Algorithms 224/294 Advantage estimation in PPO PPO uses generalized advantage estimation = an actor-critic framework using λ-return to estimate the q-value. Formally, the advantage estimator is built on a value network Vη : S × Θ → R. Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. An n-step advantage from time step t is the quantity At:t+n η (τ) = rt+1 + γrt+2 + · · · + γn−1 rt+n + γn Vη(st+n) − Vη(st). For a parameter λ ∈ [0, 1], the generalized advantage estimate at time t is GAE(λ,t) η (τ) = T−t n=1 λn · At:t+n η (τ). For a trajectory τ, PPO puts Aη(τ, t) = GAEλ,t η (τ). 225/294 PPO loss extension Apart from LCLIP π , PPO loss consist of two additional terms: 1. The value network Vη and policy network πθ might share parameters (e.g. the same feature extraction layers). In this case, η and θ should be trained together: for a sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., the PPO loss will incorporate the empirical value loss LV (η′ ) = T t=0 (Vη′ (st) − Gt(τ) target )2 (instead of sample return, the training target might be e.g. TD(0), or Aη(τ, t) + Vη(st): note that η in the target (current value of the parameter) is a constant). 2. Original PPO also used entropy regularization, adding sample entropy loss: Lentropy (π′ ) = − 1 T T−1 t=0 a π′ (a|st) · log π′ (a|st) 226/294 Total PPO loss Given a sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., the current reference policy π, and hyperparameters β1, β2 ∈ R+ , the canonical PPO loss is: LPPO τ (θ′ , η′ ) = LCLIP π (π′ ) − β1 · LV (η′ ) + β2 · Lentropy (π′ ) = T t=0 min Aη(τ, t) · rt(π, π′ ) , Aη(τ, t) · CLIP rt(π, π′ ), 1 − ε, 1 + ε − β1 · T t=0 (Vη′ (st) − Gt(τ) target )2 + β2 · − 1 T T−1 t=0 a π′ (a|st) · log π′ (a|st). Important note: the advantage estimates Aη in LCLIP are evaluated before the loss is constructed and are treated as constants inside LCLIP ! The value network parameters are only considered variable inside the value loss. Usually, a batch of τ is sampled and average loss 1 |batch| · τ∈batch LPPO τ (θ′ , η′ ) is optimized using gradient-based optimizer (eg., ADAM) to yield a policy update. 227/294 PPO high-level pseudocode Algorithm 12: PPO Input: policy network πθ, value network Vη, hyperparameters λ, |B|, β1, β2,. . . initialize θ, η; repeat sample a batch B of trajectories from policy π = πθ; foreach τ ∈ B do foreach 0 ≤ t ≤ T − 1 do at(τ) ← GAEλ,t η (τ); define LCLIP τ (π′ ) = T t=0 min at(τ) · rt(π, π′ ) , at(τ) · CLIP(rt(π, π′ ), 1 − ε, 1 + ε) ; define LPPO τ (θ′ , η′ ) = LCLIP π,τ (π′ ) − β1 · LV (η′ ) + β2 · Lentropy (π′ ) define LPPO (θ′ , η′ ) = 1 |B| · τ∈B LPPO τ (θ′ , η′ ); use ADAM or other optimizer to find θ′ , η′ approximately maximizing LPPO (θ′ , η′ ); θ ← θ′ ; η ← η′ until timeout; 228/294 Exploration vs. Exploitation: A Systematic Approach EvE dilemma When you see a good move, look for a better one. Emanuel Lasker (1868-1941), 2nd World Chess Champion • In RL, the algorithms often need to balance exploration of the MDP state space with exploitation: focusing on behavior that worked well in the past. • Typical RL algorithms ensure exploration by ε-greediness, with ε-possibly annealed towards zero over the training. (Other options: noisy nets, entropy regularization,. . . ) • These are rather ad-hoc approaches: the setting of hyperparameters (annealing rate, entropy coefficient) is often highly domain-dependent. • EvE dilemma also appears in domains that are not typically modeled as MDPs: recommender systems, disease treatment plans, or games. • The EvE dilemma is systematically studied using the formalism of multi-armed bandits (MABs). 229/294 One-armed bandit 230/294 Multi-armed bandit (MAB) reward µ(a) arm a µ(b) arm b µ(c) arm c A MAB is given by: • a finite set of arms A; • for each arm a a reward distribution Da with mean µ(a) • the Da and µ(a) are unknown to the player! For simplicity, we will assume that rewards are from the interval [0,1]. 231/294 Dynamics of multi-armed bandits • Interaction proceeds in discrete time steps 1, 2, 3, . . . , T, where T is a termination time which might or might not be known to the player. • In each time step t, we choose some arm at ∈ A to pull and receive reward rt ∼ Dat . = like a one-state MDP with stochastic rewards, but we do not known the reward distributions Di in advance. Our goal is to maximize the expected accumulated reward E[ T t=1 rt] = T t=1 E[rt]. Clearly, the expected reward is maximized by pulling, in each step t, the arm a∗ with maximal mean reward µ∗ (we assume all arms have different mean rewards): µ∗ = max a∈A µ(a) a∗ = arg max a∈A µ(a) However, since we do not know Da’s and µ(a)’s, we cannot a priori determine which arm is optimal! We need to learn something about the reward distributions by exploring individual arms, while trying to maximize the accumulated returns. 232/294 Policies in MABs Definition 58: MAB policy A policy in a multi-armed bandit problem is a function π which to each history a1, r1, a2, r2, . . . , at, rt of actions and resulting rewards assigns a distribution over arms. The policies we will work with typically based their decisions on the following statistics of the history: for each arm a we keep: • Nt(a) – the number of times the arm a was pulled by time t • Rt(a) = 1≤i≤t rt · I(ai = a) – the total reward accumulated by pulling arm a up to time t • ˆµt(a) = Rt (a) Nt (a) – the empirical mean return of arm a Node that if Nt(a) → ∞ as t → ∞, then ˆµt(a) → µ(a). 233/294 Example: ε-greedy policy Given ε ∈ [0, 1], an ε-greedy policy selects arm at+1, for any t ≥ 0, as follows: • with probability ε it selects an arm uniformly at random • with probability 1 − ε it selects an arm a such that a = arg max b∈A ˆµt(b) Intuitively, this is sub-optimal policy: even for large t, when all ˆµt(b) should be relatively good approximations of true mean rewards, the policy still selects sub-optimal arms at a constant rate. How to formalize this issue? 234/294 Regret The regret of a policy at time T is the difference between the expected return of the policy and the return we would get by always pulling the optimal arm. Formally Definition 59: Regret The regret of a policy π at time T is the quantity Regretπ(T) = T · µ∗ − Eπ [ T t=1 rt]. Another form of writing the regret: Regretπ(T) = T · µ∗ − a∈A µ(a) · Eπ [NT (a)]. 235/294 Lower bound on the regret of ε-greedy policy We will show that ε-greedy policy has regret linear in T, i.e. asymptotically the worst possible one. Let b be any sub-optimal action. Denote ∆b = µ∗ − µb > 0. Since b has probability at least ε of being pulled in every step, the expected number of times b is pulled in T steps is at least T · ε. Hence, Regretε-greedy(T) = T · µ∗ − a∈A µ(a) · Eε-greedy [NT (a)] = a∈A Eε-greedy [NT (a)] · µ∗ − a∈A µ(a) · Eε-greedy [NT (a)] = a∈A Eε-greedy [NT (a)] · (µ∗ − µ(a)) ≥ Eε-greedy [NT (a)] · ∆b ≥ T · ε · ∆b. 236/294 Logarithmic regret We will now demonstrate a simple policy achieving logarithmic regret. The policy π we will construct requires the advance knowledge of the termination time T and of the the gap between the optimal and second-best arm: ∆min = min b̸=a∗ µ∗ − µ(b) Policy π proceeds in two phases: • Phase 1 (exploration): in the first T1 = log(T)·|A|·4 ∆2 min steps it deterministically cycles through all arms. I.e., if A = {a1 , a2 , . . . , ak }, then in step i it plays arm aj where j = i (mod k) + 1. • Phase 2 (exploitation) at timestep T1 + 1, π identifies action a with maximal empirical mean ˆµT1 (a) and keeps playing this action for the remaining T2 = T − T1 steps. 237/294 Upper-bounding the regret The total regret of π can be decomposed into regrets accumulated in the two phases: Regretπ(T) = T · µ∗ − T t=1 Eπ [rt] = T1 · µ∗ − T1 t=1 Eπ [rt] R1, exploration regret + T2 · µ∗ − T t=T1+1 Eπ [rt] R2, exploitation regret Clearly R1 ≤ 1 · T1 ∈ O(log(T)). R2 depends on whether π correctly classifies the optimal arm at timestep T1 + 1. • If yes, then R2 = 0. • If no, then R2 ≤ T2 ≤ T ∈ O(T). 238/294 Bounding the probability of misclassification We have R2 ≤ T · Pπ [Mis], where Mis is the event that a∗ does not have the maximal empirical mean after T1 steps. By union bound Pπ [Mis] ≤ b̸=a∗ Pπ [ˆµT1 (b) ≥ ˆµT1 (a∗ )]. Clearly, the second-best action, call it ˜a, has the highest likelihood of achieving higher empirical mean than a∗ . I.e., Pπ [Mis] ≤ (|A| − 1) · Pπ [ˆµT1 (˜a) ≥ ˆµT1 (a∗ )]. It thus suffices to bound the probability of ˜a overtaking a∗ in empirical mean. 239/294 Probability of empirical deviation µ(˜a) ˜a µ∗ a∗ For ˜a to overtake a∗ , at least one of the arms must have empirical mean after T1 steps at least ∆min 2 -away from their true mean. Hence, it suffices to bound the probability that an empirical mean deviates too much from the true mean. 240/294 Hoeffding’s inequality for large deviations Theorem 60: Hoeffding’s inequality Let D be a distribution taking values in [0, 1]. Let µ be the mean of D and let ˆµn = n i=1 xi , where each xi is an independent sample from D. Then, for any n ∈ N+ and any δ > 0: P |ˆµn − µ| ≥ δ ≤ 2e−2nδ2 Recall T1 ≈ log(T)·|A|·4 ∆2 min . Then, e.g. for a∗ : Pπ |µT1 (a∗ ) − µ∗ | ≥ ∆min 2 ≤ 2e−2 T1 |A| ∆2 min 4 ≈ 2e−2 log(T) = 2 T2 . Hence, Pπ [Mis] ≤ C · 1 T2 for some constant C. It follows that the exploitation regret R2 is ≤ T · Pπ [Mis] = C′ · 1 T for some constant C′ . 241/294 Final bound on regret of π The total regret of π is: Regretπ(T) = R1(T) ∈O(log(T)) + R2(T) ∈O(1) ∈ O(log(T)). The logarithmic regret is the best possible: Theorem 61: Lai and Robbins (”Asymptotically efficient adaptive allocation rules”, 1985) Any policy has regret in Ω(log(T)). The disadvantage of π is that it needs to know both T and ∆min in advance. Knowledge of T can be discarded by using ε-greedy policies with adaptive ε: if εt = min{1, |A| ∆2 min·t }, then the regret is logarithmic (see David Silver’s slides). But there is actually a policy yielding logarithmic regret without the advance knowledge of either T or ∆min. 242/294 Optimism in the face of uncertainty The idea is that under-explored arms should be explored more: they could be better than we currently think (and if not, exploring them should disprove that). We will seek optimistic estimates of µ(a) in the form of upper confidence bounds: we seek an empirical quantity Ut(a) such that with high probability µ(a) ≤ ˆµt(a) + Ut(a) UCBt (a) . The policy will always pick action maximizing UCBt. Moreover, Ut(a) should be as tight as possible given available information. In particular, it should hold that if Nt(a) ≤ Nt(b), then Ut(a) ≤ Ut(b). µ(a) a µ(b) b 243/294 One-sided Hoeffding’s inequality Theorem 62: Hoeffding’s inequality (one-sided) Let D be a distribution taking values in [0, 1]. Let µ be the mean of D and let ˆµn = n i=1 xi , where each xi is an independent sample from D. Then, for any n ∈ N+ and any δ > 0: P µ − ˆµn ≥ δ ≤ e−2nδ2 . Let 1 − p be some required confidence level. We want to find “tight” Ut(a) such that µ∗ ≥ ˆµt(a) + Ut(a) with probability at most p, i.e. P µ∗ − ˆµt(a) ≥ Ut(a) ≤ p. By (one-sided) Hoeffding: P µ∗ − ˆµt(a) ≥ Ut(a) ≤ e−2Nt (a)U2 t (a) . 244/294 Deriving UCB formula From the previous we have: P µ∗ − ˆµt(a) ≥ Ut(a) ≤ e−2Nt (a)U2 t (a) . Performing substitution p = e−2Nt (a)U2 t (a) we derive the expression for Ut(a): Ut(a) = log(1/p) 2Nt(a) For this Ut(a), it holds P[µ∗ − µt(a) ≥ Ut(a)] ≤ p, i.e. P µ∗ ≤ ˆµt(a) + Ut(a) ≥ 1 − p. We want to increase the confidence over time, i.e. p should be a function of t with p → 0 as t → ∞. The standard approach is to put p = 1 tc for a suitable constant c. Then Ut(a) = log(1/p) 2Nt(a) = c log(t) 2Nt(a) = c 2 log t Nt(a) = C · log t Nt(a) 245/294 Summary of UCB policy In each step t, the UCB policy selects the arm with the highest upper confidence bound, i.e. arm a such that a = arg max a∈A UCBt(a) = arg max a∈A ˆµt(a) + C · log t Nt(a) , where C is a hyperparameter known as exploration constant. It can be shown that for [0, 1]-valued rewards, choosing C = √ 2 suffices to achieve logarithmic regret Theorem 63 When the reward distributions are over the interval [0, 1], then for C = √ 2 it holds RegretUCB (T) ∈ O(log T). Proof in optional reading: Auer, Cesa-Bianchi, Fischer: Finite-time Analysis of the Multiarmed Bandit Problem. In Machine Learning (47), 2002. 246/294 Concluding remarks on MABs • UCB-like algorithms with logarithmic regret were developed also for more general cases (e.g. distributions with unbounded support, where typically some shape of the distribution or a known bound on its variance is known). • The MAB model introduced here can be generalized in many ways (Bayesian bandits, contextual bandits, adversarial bandits) - rich area of research and applications. Optional reading: Slivkins: Introduction to Multi-Armed Bandits (arXiv). • UCB-like approaches can be used also in RL, e.g. instead of ε-greedy policies use π(s) = arg max a∈A(s) Q(s, a) + C · log N(s) N(s, a) . • Most prominently, this idea is applied in the context of Monte-Carlo tree search algorithms. 247/294 Monte Carlo Tree Search Tree? What tree? 1 2 :2 1 2 :0 1 3 :1 2 3 :-3 ...... ...... ...... ...... ...... ...... ...... ...... ...... • Rooted tree with alternating state nodes and action nodes. • Action nodes equipped with rewards and probability distributions over children. We denote p(a) the probability distribution over children of a. • Policy π in the tree assigns to each state node a distribution over its children. • For simplicity, we consider discount factor γ = 1. 248/294 Trees vs. MDPs The trees we will consider can be considered as special type of MDPs. Due to lack of cycles, all policies can be considered memoryless in the tree. However, the tree MDPs capture full generality of MDPs due to MDP unfolding. Definition 64: Unfolding of MDP For an MDP M = (S, A, p, r), the unfolding of M is a (possibly infinite) tree-shaped MDP Unfold(M) such that: • the states nodes of Unfold(M) are the histories of M (=trajectory prefixes); • if a is an action node that is a child of state node corresponding to history h, then in Unfold(M) the probability of child h, a, s of a is equal to p(s|last(h), a); • the reward function lifted similarly: rUnfold(M)(h, a, (h, a, s)) = rM(last(h), a, s) 249/294 Advantage of tree representation The tree unfolding of an MDP implicitly carries, in each node, the information about the whole history up to the reaching of that node. This can be advantageous when dealing with tasks/environments where the knowledge of history is important, e.g.: • partially observable environments (cf. Atari) • tasks with more complex objectives than maximizing the expected discounted return • adversarial environments (games) - requires distinguishing between player 1 and player 2 nodes (we will see later) In what follows, we will consider trees encoding episodic MDP tasks: the tree is potentially infinite, but the probability of all infinite branches is zero. (For the sake of concreteness, imagine chess where our opponent is playing some fixed known, possibly randomized, strategy.) 250/294 MCTS requirements The tree in which MCTS operates can be either given explicitly (if finite and small enough) or represented by some black-box model which allows for the following: • given a state node node, return a list of all action-node children of node; • given an action node anode, return a sample from p(anode), i.e. sample a child of anode according to the probability distribution specified by the tree; • given an action node anode and its child node, return the reward labeling the edge between the two nodes. The black-box can have a form of a finite explicitly represented MDP, or (in turn) of the sampling-model of such an MDP (as in classical RL). 251/294 Monte Carlo tree search – high-level properties Monte Carlo tree search (MCTS) is an online algorithm for solving (=maximizing the expected return in) tree-shaped MDPs. Online = it does not compute a complete policy for the whole MDP; instead given the current node = (current state and the history of states and actions visited so far,) the algorithm computes the (approximately) best action to play in the current step. Terminology: one step of MCTS = performing a computation to determine the best action given the current situation. MCTS can be employed over multiple steps, until a whole trajectory is generated. Thus, MCTS can be seen as an algorithmic representation of a policy. MCTS iteratively builds a structure called search tree, which is a finite sub-tree of the original (possibly infinite) MDP tree. The search tree is a global data structure shared across the steps. 252/294 MCTS master loop Algorithm 13: MCTS master loop Input: Tree-shaped MDP M with root s0 node ← s0 T ← tree with single node (root) s0 and all its child action nodes while node is not terminal do BuildTree(T ) // Modifies T a ← ActualSelect(T ) node′ ∼ p(a) node ← node′ T ← sub-tree of T rooted in node // If node not in T , this is a single-node tree with node as root. 253/294 Building the search tree Building the search tree also involves sampling trajectories from the environment. This is standard in RL: the algorithms we have seen so far were sampling from the environment to compute policies, and the policies themselves can be seen as prescriptions for interacting with (=sampling from) the environment. In contrast, planning/MCTS literature often differentiates between • the actual online actions played by the algorithm when interacting with the “true” environment, i.e. outcomes of ActualSelect function (e.g. moving a chess piece on physical a board); and • simulated actions performed when building the tree (e.g. imagining outcomes of various chess moves when thinking about the next move) The reason for the distinction is that in practice, we might want the simulated actions to be performed in a virtual model of the environment (for sake of sample efficiency), while the actual actions in the “true” environment. We will stick to the nomenclature simulated/actual action to distinguish between actions played during/at the end of individual steps. 254/294 Search tree statistics The BuildTree function maintains several statistics of each node node of the tree (both state and action node): • N(node) – the visit count of node how many times has the node been visited during the run of the algorithm • R(node) – the total return achieved from that node during the run of the algorithm (counting only rewards collected on the paths from node to a leaf) • V (node) = R(node) V (node) – the estimated value of the node (i.e., the estimate of best expected return achievable from the node; for action nodes this includes the immediate reward of that action). Like the whole search tree T , these statistics are global to the whole MCTS algorithm, and are typically not reset between BuildTree calls. The ActualSelect function also uses these statistics. Invariant: all state nodes in T have all there child action nodes also in T 255/294 Building the search three: four phases Input: Tree-shaped MDP M with root s0 node ← s0 T ← tree with single state node (root) s0 and all its (action node) children while node is not terminal do BuildTree(T ) a ← ActualSelect(T ) node′ ∼ p(a) node ← node′ T ← sub-tree of T rooted in node // If node not in T , this is a single-node tree with node as root. Tree building proceeds by repeatedly performing these 4 phases: 1. Search. 2. Expansion. 3. Rollout. 4. Backup. The phases are repeated until a predetermined timeout. 256/294 Tree building Algorithm 14: Tree building Procedure BuildTree(T ): repeat (node, anode) ← Search(T ) // Traverse top-down to the ‘‘best’’ node // not in T . Expand(T , node, anode) // Add the discovered node to T . (R, a) ← Rollout(node) // MC estimate of node’s value // via default policy. Backup(T , R, a) // Update statistics on branch from root to node. until timeout 257/294 Search phase • Traverse T from the root to the most promising action leaf. Then sample a state node child of this leaf (not yet in tree). • Traverse = in each state node, select an action and in each action node, sample a successor state node from the transition function of the environment. • Hence on each level, we also need to select “most promising” action in the current node. • Most promising = with best value estimate, but we also take exploration/exploitation dilemma into account. 258/294 Search phase 2 • We treat the decision in each concrete node as a separate multi-armed bandit problem. That is, when traversing a state node s, we select an action node a such that a = arg max a child of s V (a) + C · log N(s) N(a) , where C is an exploration constant. • I.e., when sampling, while in the tree we follow a UCB behavior policy. This approach is called upper confidence bound on trees (UCT), first proposed in optional reading: Kocsis, Szepesv´ari: Bandit-Based Monte Carlo Planning. In proceedings of ECML 2006. 259/294 Search phase: pseudocode Algorithm 15: Search phase Function Search(T ): node ← root of T repeat anode ← arg max a child of node V (a) + C · log N(node) N(a) node ∼ p(anode) until anode is a leaf return node, anode 260/294 Expansion phase Algorithm 16: Expansion phase Procedure Expand(T , node, anode): add node to T as a child of anode N(node) ← 0 R(node) ← 0 V (node) ← 0 // Add all action-node children of node foreach child a of node do add a to T as a child of node N(a) ← 0 R(a) ← 0 V (a) ← 0 (Same initialization used in root.) 261/294 Rollout phase • Perform a Monte Carlo sample of the rest of the trajectory from the newly added node. Record the return obtained. • Since we are now outside of the tree, UCT cannot be used (no statistics). Instead, we follow some fixed behavior policy called default policy. • Typical choice of default policy: in each node, select uniformly from all actions. • Domain knowledge can be used to craft more intricate default policies. 262/294 Rollout phase: pseudocode anode Algorithm 17: Rollout with uniform default policy Function Rollout(node): R ← 0 while node not terminal do sample a uniformly from all children of node if this is the first iteration of the loop then anode ← a node′ ∼ p(a) R ← R + r(node, a, node′ ) node ← node′ return R, anode 263/294 Backup phase R anode For all nodes currently in T traversed in the previous phases, update their statistics using the rollout outcome: Algorithm 18: Backup Procedure Backup(T , anode, R): n ← anode while true do N(n) ← N(n) + 1 R(n) ← R(n) + R V (n) ← R(n)/N(n) if n is the root then break if n is a state node then R ← R + the reward of the edge connecting n to its parent n ← parent(n) 264/294 MCTS Tree building summary Procedure BuildTree(T ): repeat (node, anode) ← Search(T ) Expand(T , node, anode) (R, a) ← Rollout(node) Backup(T , R, a) until timeout 265/294 MCTS: Actual action selection Input: Tree-shaped MDP M with root s0 node ← s0 T ← tree with single state node (root) s0 and all its (action node) children while node is not terminal do BuildTree(T ) a ← ActualSelect(T ) node′ ∼ p(a) node ← node′ T ← sub-tree of T rooted in node // If node not in T , this is a single-node tree with node as root. • Usually just greedy according to value estimates in the root: a = arg max a child of root(T ) V (a) . • Alternative: according to visit count: a = arg max a child of root(T ) N(a), or a ∼ softmax(Na)a child of root(T ) (both used in AlphaZero). 266/294 MCTS with Function Approximators: AlphaZero Towards AlphaZero • AlphaGo Lee (2016): uses lots of domain knowledge, won 4-1 over 9-dan Go champion Lee Sedol • AlphaGo Zero (2017): zero domain knowledge • AlphaZero (2017-2018): general game-playing (and MDP solving) algorithm 267/294 AlphaZero literature AlphaGo Zero: Silver et al.: Mastering the game of Go without human knowledge. In Nature, vol. 550 (2017). • Focuses on Go, but has a rich methodology section explaining design details. AlphaZero: Silver et al.: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. In Science, vol. 362 (2018). • More general, but many designed choices already explained in the AlphaGo paper not covered: need to look into code for details. 268/294 MDP vs. game trees 1 2 :2 1 2 :0 1 3 :1 2 3 :-3 ...... ...... ...... ...... ...... ...... ...... ...... ...... • State nodes alternate between maximizer and minimizer nodes. Maximizer/minimizer player wants to maximize/minimize the expected return achieved in its subtree. For chess-type games, targeted by AlphaZero, simpler models suffice: • Rewards only in terminal states (-1,0,+1). • Deterministic action, each action node has only one successor. In figures, we will sometimes omit action nodes, but pseudocodes will refer to the most general formulation. 269/294 AlphaZero actual play s s′ s′′ ...... ...... ...... ...... • We are in some state node (game position) s. • We call AlphaZero to suggest an action to play. • We play the action in the game and observe the next state s′ . • We wait for the oponent to play an action in his state and observe the next state s′′ . • Repeat. 270/294 AlphaZero agent ...... ...... ...... ...... Algorithm 19: AlphaZero single game Input: Game tree with root s0, param’s θ Function AlphaZeroAgent(s0, θ): node ← s0 T ← tree with single node (root) s0 and all its child action nodes while node is not terminal do BuildTree(T , θ) // Modifies T a ← ActualSelect(T ) node′ ∼ p(a) oponent responds by playing a′ node′′ ∼ p(a) node ← node′′ T ← sub-tree of T rooted in node 271/294 AlphaZero = MCTS + function approximation • The BuildTree() function of AlphaZero replaces rollouts with an evaluation of newly discovered nodes via a function approximator. • The function approximator is trained on data generate by repeated plays in the actual environment (i.e., repeated calls of the AlphaZeroAgent() procedure.) AlphaZeroAgent(s0, θ) play Train(θ, τ) generate trajectory τ updated θ Slightly inaccurate AlphaZero training diagram (see later). 272/294 AlphaZero tree & approximator Search tree of AlphaZero differs from plain MCTS in that each action node anode carries an additional attribute: prior probability Pr(anode). Given a state node node, AlphaZero’s approximator predicts these quantities: • πθ(node) ∈ R|A| : the prior probability vector, assigning a probability to each action in node; • vθ(node): the value estimate of node The training will push θ so that • πθ better approximates the policy played by AlphaZero in the actual game; • vθ better approximates the expected return of AlphaZero in the actual game. However, AlphaZero does not play according to πθ: it uses the MCTS mechanism to improve upon πθ. Hence, AlphaZero is often presented as a MCTS-based policy improvement scheme. 273/294 AlphaZero: Tree building Procedure BuildTree(T , θ): AddNoise(T ) repeat (node, anode) ← Search(T ) // Search phase R ← ExpandPredict(T , node, anode, θ) // Expansion + prediction Backup(T , R, node) until timeout πθ, vθ 274/294 AlphaZero: Search phase (theory) Algorithm 20: Search phase Function Search(T ): repeat if node is maximizer’s then sgn ← 1 else sgn ← −1 node ← root of T anode ← arg max a child of node V (a)·sgn+C·Pr(a)· log N(node) N(a) node ∼ p(anode) until anode is a leaf return node, anode Note the self-play implemented by sgn and the modulation of exploration via prior probabilities. 275/294 AlphaZero: Search phase (DeepMind implementation) Algorithm 21: Search phase Function Search(T ): repeat if node is maximizer’s then sgn ← 1 else sgn ← −1 node ← root of T anode ← arg max a child of node V (a)·sgn+C(node)·Pr(a)· N(node) N(a) + 1 node ∼ p(anode) until anode is a leaf return node, anode where C(node) = log (+N(node)+cbase cbase + cinit with cinit = 1.25 and cbase = 19652. 276/294 AlphaZero: Expansion & Prediction πθ, vθ Algorithm 22: Expansion phase Procedure ExpandPredict(T , node, anode, θ): add node to T as a child of anode N(node) ← 0 R(node) ← 0 V (node) ← 0 // Add all action-node children of node foreach child a of node do add a to T as a child of node N(a) ← 0 R(a) ← 0 V (a) ← 0 Pr(a) ← πθ(a) return vθ(node) (Same initialization used in root.) 277/294 AlphaZero: Backup R For all nodes currently in T traversed in the previous phases, update their statistics using the predicted outcome: Algorithm 23: Backup Procedure Backup(T , R, node): n ← node while true do N(n) ← N(n) + 1 R(n) ← R(n) + R V (n) ← R(n)/N(n) if n is the root then break if n is a state node then R ← R + the reward of the edge connecting n to its parent n ← parent(n) 278/294 Dirichlet noise for root exploration Procedure BuildTree(T , θ): AddNoise(T ) repeat (node, anode) ← Search(T ) // Search phase R ← ExpandPredict(T , node, anode, θ) // Expansion + prediction Backup(T , R, node) until timeout Before simulations start, we add Dirichlet noise to prior probabilities in the root node, encouraging exploration. (Dirichlet distribution = a “distribution over discrete distributions”). I.e., if the root has k child action nodes, we sample a vector ν ∼ Dirichlet(⃗α) and for each action child a of root we perform: Pr(a) ← (1 − ε) · Pr(a) + ε · ν(a), where ⃗α ∈ Rk >0 and ε ∈ (0, 1) are hyperparameters. 279/294 AlphaZero: Actual action selection Algorithm 24: AlphaZero single game Input: Game tree with root s0, param’s θ Function AlphaZeroAgent(s0, θ): node ← s0 T ← tree with single node (root) s0 and all its child action nodes while node is not terminal do BuildTree(T , θ) // Modifies T a ← ActualSelect(T ) node′ ∼ p(a) oponent responds by playing a′ node′′ ∼ p(a) node ← node′′ T ← sub-tree of T rooted in node Action in actual play determined by visit count: a = arg max a child of root(T ) N(a), or a ∼ softmax(Na)a child of root(T ). The first (greedy) approach typically used in deployment (e.g. competitive play) or later in training, while softmax typically used early in training (with temperature annealed over time to decrease exploration). 280/294 AlphaZero training AlphaZeroAgent(s0, θ) play Train(θ, τ) generate trajectory τ updated θ Slightly inaccurate AlphaZero training diagram (see later). AlphaZeroAgent(s′ 0, θ) play Buffer Train(θ, B) generate trajectory τ experience minibatch 281/294 AlphaZero training details AlphaZeroAgent(s′ 0, θ) AlphaZeroAgent(s0, θ) play Produces play s0, a0, r1, s1, a1, . . . , rT , sT For each st we also store its statistics and statistics of its child action nodes. • For each timestep t: • Compute total return Gt = T i=t+1 ri • Compute the target policy πt s.t. πt (a) = N(a) b child of s N(b) • if st is maximizer state, add (st , πt , Gt ) to buffer • if st is minimizer state, add (st , πt , −Gt ) to buffer • Given sampled experience e = (s, π, g), θ is updated so as to minimize the loss L(θ, e) = (vθ(s) − g)2 + a πt(a) · log πθ(a) + c · ||θ||2 . We typically perform minibatch updates akin to DQNs (average updates over a number of sampled experiences). 282/294 State representation: chess example The function approximator consists of feature extraction layers followed by value and policy heads. The approximator is fed certain information carried by each state node. For chess, each node contains k last positions along its history. Each position is represented as a set of 8 × 8 feature planes. Most of them represent the position of pieces, with one plane for each type of pieces of a concrete player. 0Z0Z0Z0Z ZkZpZnZ0 0ZpZ0Z0Z Z0O0Z0Z0 0Z0O0Z0Z ZBZ0Z0Z0 KZ0Z0Z0Z Z0Z0Z0Z0               0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0               K               0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0               B               0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0               p Additional planes encode castlings, repetitions, en-passant availability etc. 283/294 Action representation: chess example The policy head outputs 4, 672 action logits from which we derive the distribution over moves by applying softmax. Each move determined by: • position of the moved piece (8 · 8 = 64 possibilities) • 73 possibilities of what to do with the piece: • 7 · 8 = 56 “standard moves”: choice of 8 directions (N, NW, W, . . . , NE) and advancing by 1–7 fields in the chosen direction • 8 “knight jumps”: one of 8 possible L-shaped jumps • 9 possibilities for pawn underpromotion (knight, bishop, rook; each either via forward move or diagonal capture) Total 64 · 73 = 4, 672 moves. Illegal moves are masked out and the remaining ones renormalized. 284/294 AlphaZero approximator architecture for chess Body + value and policy heads. Body: • Initial convolutional layer with ReLU nonlinearity and batch normalization. • Followed by 19 residual blocks, each block with 2 convolutional layers (again with ReLU and batch norm.) and a skip connection around them. • All the above conv. layers apply 256 filters with 3 × 3 kernel size and stride 1. Policy head: • Single-filter unit-kernel conv. layer with stride 1 (+ ReLU and batch norm.) • Then ReLU with 256 neurons. • Then fully connected layer. • Then final tanh layer of size 1. Policy head: • One more conv. layer as above (inlc. ReLU and batch norm). • Then a final conv. layer of 73 unit-kernel(?) filters. 285/294 AlphaZero experiments (from Silver et al. Science paper) AZ trained for 700, 000 steps, ∼ 8 hours for chess on 5000 TPUs. Note: this was 2018 version of Stockfish without NN evaluation. Modern version of Stockfish use NN evaluation and typically outperform open-source AlphaZero-based chess agents (e.g. LeelaChessZero). 286/294 Limitations of AlphaZero • MCTS simulations need a (software) simulator of the environment (at least, we need to be able to perform sampling from an arbitrary given state). Such simulator might not always be available. (Actual environment too complex, dynamics unknown (Atari), . . . ). • Moreover, AlphaZero is most suited for ”discrete enough” domains (such as chess). Working with frame-like states of Atari is impractical. (E.g. how to check whether a given frame is already included in the search tree?) The MuZero algorithm solves both issues by learning a deep model of the environment, including a suitable encoding of states. Schrittwieser et al: Mastering Atari, Go, chess and shogi by planning with a learned model. In Nature, vol 588 (2020). 287/294 MuZero: Main idea Original MuZero works only for deterministic environments. The high-level structure is similar to AlphaZero. However, the nodes in MCTS simulations are formed by elements of a fixed latent space LS (Typically some low-dimensional vector space.) From the actual plays, the algorithm trains the following networks: • The representation function gθ : S → LS • The dynamic dunction hθ : LS × A → LS × R • The value and policy approximators vθ : LS → R and πθ : LS → D(A) (same role as in AlphaZero). 288/294 MuZero: High-level picture Apart from training (see next slide for intuition), the “master loop” of a MuZero agent looks similar to AlphaZero. During actual play: • In a current state, perform MCTS simulations to determine the best action. • Play the action. • Observe reward and new state in the actual environment. • Repeat. The major difference is in the “MCTS simulation” part, which is performed in the latent space. 289/294 MuZero training intuition (slide by Richard Schwarz) 290/294 MCTS in latent space Given the current actual state s: • Embed s into latent space to get latent state ˜s = gθ(s). • Using the dynamics function hθ, build the MCTS search tree from ˜s using the usual UCT approach. The nodes of the tree are elements of the latent space! (The approximators vθ, πθ are also used during the build, as in normal AlphaZero.) • Use statistics of the root to select the best action to play in actual game. 291/294 MuZero experiments source: Schrittwieser et al: Mastering Atari, Go, chess and shogi by planning with a learned model 292/294 Further developments of MuZero and MCTS • extension to stochastic environments: Antonoglou, Schrittwieser, Ozair, Hubert, Silver: Planning in Stochastic Environments with a Learned Model. In proceedings of ICLR 2022. • Tricks to increase sample efficiency (e.g. GumbelZero). • . . . MCTS has formed a basis for practical algorithms in various domains, e.g.: • AlphaChip • AlphaGeometry 293/294 Glimpse of the future 294/294