Credit assignment problem Imagine playing a game such as chess: Players make their moves with only partial knowledge of their consequences. Target value is assigned only when the game is finished. Imagine a cleaning robot moving in a space that needs to systematically clean the space, occasionally recharge batteries. Issues: How does the robot evaluate quality of its moves? How does it decide where to go? Reinforcement learning: learn gradually from experience 1 Reinforcement learning – overview Supervised learning: Immediate feedback Unsupervised learning: No feedback Reinforcement learning: Delayed feedback The model: Agents that sense & act upon their environment Applications: Cleaning robots Investments strategies Game playing Scheduling Verification ... 2 A concrete example Robot grid world 6 states, arrows are possible actions Robot gets a reward for performing actions ... so eventually gets a sequence of rewards r1, r2, . . . ... maximizes the discounted reward r1 + γr2 + γ2r3 + · · · here 0 < γ < 1 is a discount factor The goal is to find an optimal policy which chooses an appropriate action in each state. 3 Deterministic Markov decision processes A deterministic Markov decision process (DMDP) consists of a set of states S a set of actions A a transition function δ : S × A → S a reward function r : S × A → R Semantics: Assuming that the current state is s ∈ S, the agent chooses an action a, receives the reward r(s, a), and then moves on to the state δ(s, a). A policy is a function π : S → A which prescribes how to choose actions in states. Following a given policy π starting in a state s, the agent collects a sequence of rewards rπ,s = rπ,s 1 , rπ,s 2 , . . .. Here rπ i is the reward collected in the i-th step. 4 Deterministic Markov decision processes How to specify the overall quality of a policy? Given a policy π and a state s, denote by V π(s) the discounted reward rπ,s 1 + γrπ,s 2 + γ2rπ,s 3 + · · · Here 0 < γ < 1 is a discount factor that specifies how the agent "sees" the future. Our goal: Find π which belongs to arg maxπ V π (s) for all s ∈ S. For the rest we fix a discount factor 0 < γ < 1. 5 Example Consider the policies: π1: always go down and right π2: whiteboard ... Let s be the left-most down-most state. What is V π1 (s) ? What is V π2 (s) ? In both cases consider γ = 0.5. 6 The Value Define the optimal policy π∗ by π∗ ∈ arg max π V π (s) for all s ∈ S We use V ∗(s) to denote V π∗ (s). Compute V ∗(s) for all six states if the discount factor γ = 0.9. 7 Maximizing the value The value V ∗ can be expressed using the following recurrent equations: V ∗ (s) = max π rπ,s = max π rπ,s 1 + γrπ,s 2 + γ2 rπ,s 3 . . . = max π max π rπ,s 1 + γrπ ,s 2 + γ2 rπ ,s 3 . . . = max a max π r(s, a) + γrπ ,δ(s,a) = max a r(s, a) + γ max π rπ ,δ(s,a) = max a (r(s, a) + γV ∗ (δ(s, a))) Value iteration algorithm: Initialize V ∗ 0 (s) = 0 for every s. Compute V ∗ i+1(s) from V ∗ i by V ∗ i+1 = max a (r(s, a) + γV ∗ i (δ(s, a))) 8 Generalization: Markov Decision Processes Often the transitions function δ as well as rewards r are not deterministic. If we can estimate the probability of outcomes, we may use δ : S × A → D(S) where D(S) is the set of all probability distributions on S. For example: δ(s, a)(s ) = 1/2 and δ(s, a)(s ) = 1/2 means that if the agent chooses a in s, then it proceeds randomly either to s or to s . Similarly, r : S × A → D(R) where R is a "reasonable" subset of R. Now the sequence of rewards is not determined by a policy, only a distribution on sequences of rewards. The goal is to maximize the expected discounted reward. The previous recursive equations can be generalized to MDPs. 9 Q-learning The recursive equations can be used to compute optimal policies if δ and r are known to the agent. But what if they are not? Assume that the agent only observes: The current state The current reward The set of available actions The idea: Try to learn an approximation of the function Q(s, a) = r(s, a) + γV ∗(δ(s, a)) Apparently, when we have Q(s, a) = r(s, a) + γV ∗(δ(s, a)), then V ∗(s) = maxa Q(s, a ) and thus we have V ∗(s). But how to approximate Q(s, a) when the agent observes only a local state and reward? 10 Q-learning algorithm Denote by ˆQ the successive approximations of Q. Initialize ˆQ(s, a) = 0 (i.e. whenever a new state-action pair is encountered, the algorithm assumes that ˆQ(s, a) = 0) Observe the current state s Do forever: Select an action a and execute it Receive an immediate reward r Observe the new state s Update the value of ˆQ(s, a) by ˆQ(s, a) = r + γ max a ˆQ(s , a ) s = s Under some technical conditions, ˆQ converges to Q. 11 Q-learning example ˆQ(s1, aright) = r + γ max a ˆQ(s2, a ) = 0 + 0.9 max{63, 81, 100} = 90 12 Exploration vs exploitation In the Q-learning algorithm, we have not specified how to choose an action in each iteration. Possible approaches: 1. maximize ˆQ(s, a) 2. give each action equal opportunity 3. choose randomly with a probability k ˆQ(s,a)/ a k ˆQ(s,a ) where k > 1 ad 1. exploits the values computes so far and thus improves the policy currently expressed by ˆQ ad 2. explores the space while ignoring ˆQ ad 3. combines exploration & exploitation: if ˆQ(s, a) >> maxa =a ˆQ(s, a ), the action a is chosen most of the time but not always 13 Representation of ˆQ The ˆQ can be represented by various means neural networks – Deep Q-learning decision trees SVM ... 14