6.2 Multilevel Methods for Reinforcement Learning
6.2.3 Model-Free Case: TD with Additive Preconditioner
We would now like to deploy the multilevel approach to accelerate the TD (λ)-method. To this end, we invoke the approach of a so-called preconditioner, which we shall present in the following.
We consider the inter-level operators Illþ1, Ilþ1l as described in Sect. 6.2.1.
Moreover, letIml be the prolongator from levellto levelmdefined as
IlmẳImm1Im1m2. . .Il1l , ð6:16ị
andIllis stipulated to be the identity matrix. Let a preconditionerC1t be given by C1t ẳXL
lẳ0
βl,tI0lI0l: ð6:17ị It is summarized in Algorithm 6.3.
Algorithm 6.3:BPX preconditioner (Ziv)
Input:residualy, number of gridsL,interpolatorIllþ1, restrictorIlþ1l , coefficient vectorsβl
Output:new guessx∈Rn 1:procedureBPX(y) 2: x0:ẳy
3: forkẳ1,. . .,Ldo
4: xl:ẳIll1xl1 ⊲restriction 5: end for
6: forkẳ0,. . .,Ldo
7: xl:ẳβlxl ⊲scaling 8: end for
9: forkẳL,. . ., 1do
10: xl1:ẳIl1l xl ⊲interpolation 11: end for
12: returnx0 13:end procedure
The preconditioner of Ziv (6.17) can be viewed as an algebraic counterpart to the BPX preconditioner [BPX90] well known in numerical analysis.
Then the preconditioned TD(λ)-method (for simplicity we avoid the iteration indexes)
w:ẳwỵαtC1t ztdt ð6:18ị converges almost surely to the same solution as the TD(λ)-method (3.20).
The proof of convergence is essentially based on the subsequent theorem from [Ziv04], a further generalization of which has been devised in [Pap10].
Theorem 6.2 Let the prerequisites for convergence of TD(λ) of Theorem 3.2 be satisfied. Moreover, let B1be a symmetric and positive definite (spd) NN-matrix.
Then the preconditioned TD(λ)-method
w:ẳwỵαtB1ztdt ð6:19ị converges as well.
Since C1t is an spd NN-matrix, the hierarchically preconditioned TD(λ) converges.
The preconditioned TD(λ) algorithm (6.18), however, operates in terms of action values. Hence, the inter-level operators are needed for state-action pairs (s, a) rather than for single states. Yet how can we define hierarchies of actions? Since in the recommendation approach the spacesSandAare isomorphic (4.1), actions may be treated in the same way as states, and the same inter-level operators may be used for the former.
While the states inScorrespond to products that are endowed with recommen- dations, the actionsAcorrespond to the recommended products. Thus, similarly to (6.11), the following definition of the prolongator^Ilþ1l suggests itself:
I^lþ1l
ijβγẳ 1, i∈Gβ∧j∈Hγð ịi
0, else: : ð6:20ị
Here, the aggregationsHγ(i), γ ẳ1, . . ., mirefer toA(si), that is, all actions executable in state si. Thus, the prolongation matrix is a block-diagonal matrix, where the blocks correspond to states and the block values to the actions.
Subsequently, we shall address a modification of the prolongator ^Ilþ1l . This weightedprolongator is defined as follows:
^I
l lþ1
ijβγ
ẳ jA sð ịi jA aj , i∈Gβ ∧ j∈Hγð ịi
0, else : ð6:21ị
Here, |A(si)| denotes the number of all actions in statesi, that is, all rules for the corresponding product, and |A(aj)| the number of actions in the state associated withaj, that is, the rules with the associated product for a conclusion. This weighted prolongator thus prefers rules with “strong” prerequisite or subsequent products, respectively.
In general, one can derive multiple hierarchies from the product specifications, for example, by means of shop hierarchies, commodity groups, and product attri- butes. Consequently, a corresponding preconditionerC^1i can be derived for each hierarchy according to (6.17). This gives rise to the question of whether preconditioners can also be applied in a combined fashion. Indeed, this is possible, for example, with respect to the preconditionerC^1a :
C^1a ẳC^11 ỵC^12 ỵ. . .ỵC^1n , ð6:22ị
wherendenotes the number of all used hierarchies.Since all of the preconditioners C^1i are spd, so is C^1a , and convergence of preconditioned TD(λ) follows from Theorem 6.2.
6.2 Multilevel Methods for Reinforcement Learning 113
Example 6.4 We consider the state space of Example 6.1 with the corresponding iteratorI01 and restrictorI10. This gives the following preconditioner for the state- value function:
C1xẳ ex1
e x2
e x3
2 4
3 5ẳ
1þ1 2
1 2 0 1
2 1þ1 2 0 0 0 1þ1 2
66 66 64
3 77 77 75
x1
x2
x3
2 4
3 5:
The example of the state aggregations should now be extended to include the associated actions, where initially all states are permissible as actions in all states.
The result is shown in Fig.6.6. For ease of reading, the actions are shown asxat level 0 andyat level 1, where the lower index represents the start nodes and the upper index the target nodes. For instance,x21 is the recommendation of product 2 for product 1 at level 0.
Note that on the finest grid – that is, the product level – the reflexive relationxiiis practically meaningless, since a product cannot recommend itself. At levels>0 these actions are meaningful, however, since they are a measure of the strength of product recommendations within the same group relative to one another.
31
x
32 1 x
x2
13 2 x x2
23 2 x
x1
33
x Level 0
Level 1
x1 x2 x3
y1 y2
11
y y22
12
y
12
y
11
x
Fig. 6.6 Interpolation operator for state-action aggregations
The following interpolation and restriction matrix applies to the example under consideration:
^I01^y ẳ x11 x21 x31 2 4
3 5 x12 x22 x32 2 4
3 5 x13 x23 x33 2 4
3 5 2 66 66 66 66 66 66 4
3 77 77 77 77 77 77 5
ẳ 1 1 0
0 0 1 2 4
3 5 0 1
1 0
0 0 1 2 4
3 5 0
0 1 1 0
0 0 1 2 4
3 5 2
66 66 66 66 66 66 4
3 77 77 77 77 77 77 5
y11 y21 y12 y22 2 66 4
3 77 5,^I10^x ẳ
y11 y21 y12 y22 2 66 4
3 77 5
ẳ 1 4
1 4 0
0 0 1
2 2
66 64
3 77 75
1 4
1 4 0
0 0 1
2 2
66 64
3 77 75 0
0 0 1 2
1 2 0
0 0 1
2 4
3 5 2
66 66 66 66 66 4
3 77 77 77 77 77 5
x11 x21 x31 2 4
3 5 x12 x22 x32 2 4
3 5 x13 x23 x33 2 4
3 5 2 66 66 66 66 66 66 4
3 77 77 77 77 77 77 5 ,
where the preconditionerC^1is derived as follows:
C^1x^ẳ ex11 ex21 ex31 2 4
3 5 ex12 ex22 ex32 2 4
3 5 ex13 ex23 ex33 2 4
3 5 2 66 66 66 66 66 66 4
3 77 77 77 77 77 77 5
ẳ
1þ1 4
1 4 0 1
4 1þ1 4 0 0 0 1þ1 2 2
66 66 66 66 4
3 77 77 77 77 5 1
4 1 4 0 1 4
1 4 0 0 0 1 2 2 66 66 66 66 4
3 77 77 77 77 5 0
1 4
1 4 0 1 4
1 4 0 0 0 1 2 2 66 66 66 66 4
3 77 77 77 77 5
1þ1 4
1 4 0 1
4 1þ1 4 0 0 0 1þ1 2 2
66 66 66 66 4
3 77 77 77 77 5 0
0 0 1þ1
2 1 2 0 1
2 1þ1 2 0 0 0 1þ1 2
66 66 64
3 77 77 75 2
66 66 66 66 66 66 66 66 66 66 66 66 4
3 77 77 77 77 77 77 77 77 77 77 77 77 5
x11 x21 x31 2 4
3 5 x12 x22 x32 2 4
3 5 x13 x23 x33 2 4
3 5 2 66 66 66 66 66 66 4
3 77 77 77 77 77 77 5 :
For simplicity’s sake all scaling factorsβl,twere set to a constant 1. We should now take a brief look at the method of operation of the preconditioner. Thus, an update via the actionx21leads to an update of the actionsx21themselves and alsox12in accordance with:
e
x21ẳ 1ỵ1 4
x21, ex12ẳ1 4x21:
6.2 Multilevel Methods for Reinforcement Learning 115
This results from the reflexive coarse grid actiony11 of the groupy1on itself.
Of course the reflexive update forx21is especially strong here.
An update via the action x31 leads to an update of the actions x31 themselves and alsox32:
e
x31ẳ 1ỵ1 2
x31, ex32ẳ1 2x31:
It is the coarse grid actiony21of the groupy1ony2that is responsible for this.■ Figure6.7illustrates the general logic of the updates using the example of the action (s0,a0).
An update of the rule (s0,a0) therefore leads not only to the update of the rule itself but also to the update of all rules in the same state groupG of the initial products0into the same action groupHof the recommended producta0.
From a technical point of view, there is another positive aspect: when the preconditioner C^1 updates an action value for the state-action pair (s,a), even though for (s,a) still no rule exists, it can be generated automatically. In this way the hierarchical preconditioner automatically generates new recommendations for products without recommendations (due to a lack or too little transaction history).
We will also stress the subject into the next section.