The Model-Based Case: AMG

6.2 Multilevel Methods for Reinforcement Learning

6.2.2 The Model-Based Case: AMG

In the following, we shall investigate and discuss the algebraic multigrid method for solving the Bellman equation (3.15) or (3.17), respectively. The approach was proposed by Bertsekas and Castanon in [BC89]. Without loss of generality, we shall hold back on the 2-level method, that is,lẳ0 is the fine grid andlẳ1 the coarse grid.

LetL ẳI01 be the aggregation prolongator according to (6.11). Moreover, let Rbe the restriction operator given by

R:ẳLTWL11

LTW, W:ẳdiagð ị,w ð6:14ị (of which (6.12) is the special case whereW ẳI),wherew∈Rn is (component- wise) positive and sums to 1. The matrixRis also referred to as theMoore-Penrose inverse of L with respect to the w-weighted inner product.

We now consider the 2-level-mutligrid procedure according to Algorithm 6.1 with a single smoothing step, that is,ν1ẳν2ẳ1, by means of the Richardson iteration

x:ẳIT x,ð yị:ẳxỵyAx:

As we carry out only one smoothing step andyis taken to be the residualAx^y, the smoothing simplifies to

x:ẳxỵðAxy^ị Axẳxy^:

Of course, this does no longer hold for multiple smoothing steps. For two smoothing steps, we would obtain

x:ẳðxy^ị ỵðAx^yị A xð y^ị ẳx2y^ỵA^y:

The resulting multigrid method is summarized in Algorithm 6.2. Hereupon, the upper index1has been omitted for the sake of readability.

Algorithm 6.2:Multigrid V-cycle

Input:matrix A, right-hand side b, prolongator L, restrictor R, initial iterate x:ẳx0∈Rn

Output:approximate solutionex∈Rn of (6.4) 1:procedureVCYCLE(y)

2: x:ẳxy ⊲pre-smoothing

3: y1:ẳR(Axy) ⊲computing the residual

4: x1:ẳ(A1)1y1 ⊲direct solver on the coarse grid

5: x:ẳx+Lx1 ⊲coarse grid correction

6: x:ẳxy ⊲post-smoothing

7: returnx 8:end procedure

9:returnVCYCLE(b) ⊲initial call

The following rather technical convergence result has been established in [Pap10].

Theorem 6.1 (Theorem 3.7.1 in [Pap11]) Let

Pe∈Rnn, epij:ẳ αqð ịjβ, i,j∈Gβ

0, else ð6:15ị

for0 α 1,and

qð ịjβ 0, j∈Gβ, X

j∈Gβ

qð ịjβ ẳ1, β∈m:

Moreover, we define

A:ẳIγP, Q:ẳIA,

K:ẳIL RALð ị1RA, ε:ẳPPe

ε :ẳkαIRPLk1,and c:ẳmin

β,j qð ịjβ:

6.2 Multilevel Methods for Reinforcement Learning 107

Let ε^ <γ1ð1αγị. Then the asymptotic convergence rate of Algorithm 6.2 satisfies

ρðKQị kKQk1 2 1ð αγcịγỵγ2

1αγ εþ γ2k kA 1ε^ 1γ αð ỵ^εị

1αγ

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

ẳ:δ

If the stronger conditionε<γ1(1αγ)holds, we obtain δ γ2k kA 1ε

1γ αð ỵεị 1αγ

Proof Let

Ae:ẳIγP,e Ke :ẳIL ReAL

RA,e Qe :ẳIA:e

In the following, we shall establish an expression forKQin terms ofKe andQ.e To this end, notice that we have for someB∈RmmandδA∈Rnn,

K ẳIL RALe 1

þBRA

ẳIL RALe 1

RA|ﬄ{zﬄ}LBR

ẳ:B A

ẳIL RALe 1

ReAL RALe 1

RδAB A

ẳKe11|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}αγLRδABA:

ẳ:δK Moreover, we have

KQẳ|{z}KeQe

ẳO

þKδQe þδKQ

where δQ:ẳQQ, and the identitye KeQe ẳO follows from the fact that, as an immediate consequence of the definition ofP,e Ke is an oblique projector along the range of Pe (cf. Lemma 3.2.7 in [Pap11]). Combining the above expressions, we obtain

KQẳ |ﬄ{zﬄ}KeδQ

ẳ:Eð ị1 LR

1αγδQQ

|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}

ẳ:Eð ị2

þ |ﬄ{zﬄ}BAQ

ẳ:Eð ị3 :

The remainder of the proof is by bounding the additive terms on the right-hand side of the above equation separately:

Eð ị1 1 IL RALe 1

RAe

1k kδQ 1 1þk kL 1 RALe

k kR 1Ae

γε

ẳ 1ỵk kA 1 1αγ 0

1 Aγε:

Moreover, row stochasticity ofPeimplies that (cf. Eq. (3.2.4) in [Pap11])

IγPe

1 ẳmax

i∈n j1γepiij þγ X

j∈n\i e pij

1 A

ẳmax

i∈n

j1γepiij ỵγð1epiiị

ẳ1ỵγ 12 min

i∈n epii

, which, by definition ofA, gives rise toe

1 ẳmax

k∈m Inkαγ1 qð ịk T

ẳ1ỵð12cịαγ Eð ị1 12 1ð αγcị

1αγ γε:

As forE(2), we obtain

Eð ị2 1k kLR 1

1αyk kδQ 1k kQ 1ẳ 1 1αγγεγ:

Finally, we bound kE(3)k∞ as follows: the assumption that ^ε <γ1ð1αγị yields

RδAL

k k1 k kδA 1γ^ε 1αγẳ 1 RALe 1

Hence, RAL and ReAL satisfy the hypothesis of Theorem 2.7.2 in [GVL96, pp. 80–86], and we may apply the perturbation provided therein to establish

6.2 Multilevel Methods for Reinforcement Learning 109

k kB 1

κ1 ReAL

k kδA 1 ReAL

1κ1 RALe

k kδA 1 RALe

RALe 1

ẳ

RδAL k k1

RALe 1

1kRδALk1 RALe

ReAL 1

ẳ kRδALk1 RALe

1kRδALk1 1 1αγ

ẳ kRδALk1 1αγkRδALk1

1 1αγ γ^ε

1αγγε^ 1 1αγ, where we invoked upon the identity

κ1 RALe

ẳκ1ðð1αγịIị ẳ1: Hence, we obtain

Eð ị3 1 kLBRk1k kA 1k kQ 1 γ2ε^ 1ðαỵε^ịγ

1γ k kA 1:

If the stronger conditionε<γ1(1αγ) holds, we attain the stronger bound Eð ị3 1 γ2ε

1ðαỵεịγ 1γ k kA 1

from essentially the same calculation. □

The message of this result is as follows: the closer the transition probability matrix is to a block-diagonal matrix composed of rank-1-blocks corresponding to the classes of the partition, the faster converges the AMG procedure. With regard to recommendation engines, we may conclude that the method is sensibly applicable if a major part of the transitions is between products in the same class. Furthermore, the behavior within the classes should be almost memoryless, that is, the transition to a product within a class hardly depends on the previous product.

Example 6.3 For a shop with 9 products (nẳ9) and using 3 partitions (βẳ3) with 2, 4, and 3 products, respectively, the matrixPehas the following structure:

Peẳ

a a b b

0 0

c c c c

d d d d

e e e e

f f f f 2

66 4

3 77

5 0

0 0

g g g h h h i i i 2

3 5 2

66 66 66 66 66 66 4

3 77 77 77 77 77 77 5

: ■

Obviously, the first assumption is realistic, since, in a typical shop, most transitions take place within product categories, provided that the latter have been chosen sensibly. The second assumption, however, is violated in most practical situations.

An alternative approach with favorable convergence properties in more general situations isiterative aggregation-disaggregation (IAD).This method is obtained by replacing the Richardson sweep by a so-called additive algebraic Schwarz sweep. Having its origins in the field of partial differential equations, the latter is a domain decomposition-based procedure which can also be applied to systems of linear equations of the form (6.4). In this context, one speaks ofalgebraic Schwarz methods. We shall now provide a brief outline of the algebraic Schwarz sweep (details may be found in [Pap10]).

For each set in the partition, we restrict the residual and the corresponding matrix coefficients to the indices therein and add the solution of the thus obtained system to the corresponding entries of the current iterate. It may easily be verified that the algebraic Schwarz sweep yields the exact solution in case of a block- diagonal system. Due to continuity, we may conclude that – if applied in an iterative fashion – the method exhibits swift convergence if the system is almost block-diagonal.

From a practical point of view, this complies with the above-described situa- tion where the major part of the state transitions takes place within the sets of the partition. Hence, we may expect the method to exhibit rapid convergence in a larger class of practical situations even without a coarse grid correction. Sadly enough, practical experience in [Pap10] reveal that even in cases where the second condition of the above is not satisfied, the aggregation method with a Richardson step outperforms the IAD method with respect to computation time, although the number of iterations is larger by 1 up to 2 orders of magnitude. This is due to the fairly greater computational intensity of the algebraic Schwarz sweep, which requires the solution of a multitude of smaller systems of linear equations, whereas the Richardson step consists of only one matrix–vector multiplication.

6.2 Multilevel Methods for Reinforcement Learning 111

Weaknesses of Current Recommendation Engines

On the Convergence and Implementation