Credit assignment problem
Imagine playing a game such as chess:
Players make their moves with only partial knowledge of their
consequences.
Target value is assigned only when the game is ﬁnished.
Imagine a cleaning robot moving in a space that needs to
systematically clean the space,
occasionally recharge batteries.
Issues:
How does the robot evaluate quality of its moves?
How does it decide where to go?
Reinforcement learning: learn gradually from experience
1
Reinforcement learning – overview
Supervised learning: Immediate feedback
Unsupervised learning: No feedback
Reinforcement learning: Delayed feedback
The model:
Agents that sense & act
upon their environment
Applications:
Cleaning robots
Investments strategies
Game playing
Scheduling
Veriﬁcation
...
2
A concrete example
Robot grid world
6 states, arrows are possible actions
Robot gets a reward for performing
actions
... so eventually gets a sequence of
rewards r1, r2, . . .
... maximizes the discounted reward
r1 + γr2 + γ2r3 + · · ·
here 0 < γ < 1 is a discount factor
The goal is to ﬁnd an optimal policy which chooses an appropriate
action in each state.
3
Deterministic Markov decision processes
A deterministic Markov decision process (DMDP) consists of
a set of states S
a set of actions A
a transition function δ : S × A → S
a reward function r : S × A → R
Semantics: Assuming that the current state is s ∈ S, the agent
chooses an action a, receives the reward r(s, a), and then moves on
to the state δ(s, a).
A policy is a function π : S → A which prescribes how to choose
actions in states.
Following a given policy π starting in a state s, the agent collects a
sequence of rewards rπ,s = rπ,s
1 , rπ,s
2 , . . ..
Here rπ
i is the reward collected in the i-th step.
4
Deterministic Markov decision processes
How to specify the overall quality of a policy?
Given a policy π and a state s, denote by V π(s)
the discounted reward rπ,s
1 + γrπ,s
2 + γ2rπ,s
3 + · · ·
Here 0 < γ < 1 is a discount factor that speciﬁes how the agent
"sees" the future.
Our goal: Find π which belongs to arg maxπ V π (s) for all s ∈ S.
For the rest we ﬁx a discount factor 0 < γ < 1.
5
Example
Consider the policies:
π1: always go down and right
π2: whiteboard ...
Let s be the left-most down-most state.
What is V π1 (s) ?
What is V π2 (s) ?
In both cases consider γ = 0.5.
6
The Value
Deﬁne the optimal policy π∗ by
π∗
∈ arg max
π
V π
(s) for all s ∈ S
We use V ∗(s) to denote V π∗
(s).
Compute V ∗(s) for all six states if the discount factor γ = 0.9.
7
Maximizing the value
The value V ∗ can be expressed using the following recurrent
equations:
V ∗
(s) = max
π
rπ,s
= max
π
rπ,s
1 + γrπ,s
2 + γ2
rπ,s
3 . . .
= max
π
max
π
rπ,s
1 + γrπ ,s
2 + γ2
rπ ,s
3 . . .
= max
a
max
π
r(s, a) + γrπ ,δ(s,a)
= max
a
r(s, a) + γ max
π
rπ ,δ(s,a)
= max
a
(r(s, a) + γV ∗
(δ(s, a)))
Value iteration algorithm:
Initialize V ∗
0 (s) = 0 for every s.
Compute V ∗
i+1(s) from V ∗
i by
V ∗
i+1 = max
a
(r(s, a) + γV ∗
i (δ(s, a)))
8
Generalization: Markov Decision Processes
Often the transitions function δ as well as rewards r are not
deterministic.
If we can estimate the probability of outcomes, we may use
δ : S × A → D(S) where D(S) is the set of all probability
distributions on S.
For example: δ(s, a)(s ) = 1/2 and δ(s, a)(s ) = 1/2 means that if the
agent chooses a in s, then it proceeds randomly either to s or to s .
Similarly, r : S × A → D(R) where R is a "reasonable" subset of R.
Now the sequence of rewards is not determined by a policy, only a
distribution on sequences of rewards.
The goal is to maximize the expected discounted reward.
The previous recursive equations can be generalized to MDPs.
9
Q-learning
The recursive equations can be used to compute optimal policies if
δ and r are known to the agent. But what if they are not?
Assume that the agent only observes:
The current state
The current reward
The set of available actions
The idea: Try to learn an approximation of the function
Q(s, a) = r(s, a) + γV ∗(δ(s, a))
Apparently, when we have Q(s, a) = r(s, a) + γV ∗(δ(s, a)), then
V ∗(s) = maxa Q(s, a ) and thus we have V ∗(s).
But how to approximate Q(s, a) when the agent observes only
a local state and reward?
10
Q-learning algorithm
Denote by ˆQ the successive approximations of Q.
Initialize ˆQ(s, a) = 0
(i.e. whenever a new state-action pair is encountered, the algorithm
assumes that ˆQ(s, a) = 0)
Observe the current state s
Do forever:
Select an action a and execute it
Receive an immediate reward r
Observe the new state s
Update the value of ˆQ(s, a) by
ˆQ(s, a) = r + γ max
a
ˆQ(s , a )
s = s
Under some technical conditions, ˆQ converges to Q.
11
Q-learning example
ˆQ(s1, aright) = r + γ max
a
ˆQ(s2, a )
= 0 + 0.9 max{63, 81, 100}
= 90
12
Exploration vs exploitation
In the Q-learning algorithm, we have not speciﬁed how to choose
an action in each iteration.
Possible approaches:
1. maximize ˆQ(s, a)
2. give each action equal opportunity
3. choose randomly with a probability k
ˆQ(s,a)/ a k
ˆQ(s,a ) where
k > 1
ad 1. exploits the values computes so far and thus improves the
policy currently expressed by ˆQ
ad 2. explores the space while ignoring ˆQ
ad 3. combines exploration & exploitation: if
ˆQ(s, a) >> maxa =a
ˆQ(s, a ), the action a is chosen most of
the time but not always
13
Representation of ˆQ
The ˆQ can be represented by various means
neural networks – Deep Q-learning
decision trees
SVM ...
14