Scalable cooperative multiagent reinforcement learning in the context of an organization

SCALABLE COOPERATIVE MULTIAGENT REINFORCEMENT LEARNING IN THE CONTEXT OF AN ORGANIZATION SEPTEMBER 2006SHERIEF ABDALLAHB.Sc., CAIRO UNIVERSITYM.Sc., CAIRO UNIVERSITYM.Sc, UNIVERSITY OF M

Trang 1

SCALABLE COOPERATIVE MULTIAGENT

REINFORCEMENT LEARNING IN THE CONTEXT OF

AN ORGANIZATION

A Dissertation Presented

bySHERIEF ABDALLAH

Submitted to the Graduate School of theUniversity of Massachusetts Amherst in partial fulfillment

of the requirements for the degree ofDOCTOR OF PHILOSOPHY

September 2006Computer Science

Trang 2

UMI Number: 3242334

3242334 2007

UMI Microform Copyright

ProQuest Information and Learning Company

300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346

by ProQuest Information and Learning Company

Trang 3

Trang 4

AN ORGANIZATION

A Dissertation Presented

bySHERIEF ABDALLAH

Approved as to style and content by:

Victor Lesser, Chair

Abhi Deshmukh, Member

Sridhar Mahadevan, Member

Shlomo Zilberstein, Member

W Bruce Croft, Department ChairComputer Science

Trang 5

AN ORGANIZATION

SEPTEMBER 2006SHERIEF ABDALLAHB.Sc., CAIRO UNIVERSITYM.Sc., CAIRO UNIVERSITYM.Sc, UNIVERSITY OF MASSACHUSETTSPh.D., UNIVERSITY OF MASSACHUSETTS AMHERST

Directed by: Professor Victor Lesser

Reinforcement learning techniques have been successfully used to solve single agentoptimization problems but many of the real problems involve multiple agents, ormulti-agent systems This explains the growing interest in multi-agent reinforcementlearning algorithms, or MARL To be applicable in large real domains, MARL al-gorithms need to be both stable and scalable A scalable MARL will be able toperform adequately as the number of agents increases A MARL algorithm is stable

if all agents (eventually) converge to a stable joint policy Unfortunately, most of theprevious approaches lack at least one of these two crucial properties

This dissertation proposes a scalable and stable MARL framework using a network

of mediator agents The network connections restrict the space of valid policies, which

Trang 6

reduces the search time and achieves scalability Optimizing performance in such asystem consists of optimizing two subproblems: optimizing mediators’ local policiesand optimizing the structure of the network interconnecting mediators and servers.

I present extensions to Markovian models that allow exponential savings in timeand space I also present the first integrated framework for MARL in a network,which includes both a MARL algorithm and a reorganization algorithm that workconcurrently with one another To evaluate performance, I use the distributed taskallocation problem as a motivating domain

Trang 7

TABLE OF CONTENTS

Page

ABSTRACT iv

LIST OF TABLES x

LIST OF FIGURES xi

CHAPTER 1 INTRODUCTION 1

1.1 The Distributed Task Allocation Problem, DTAP 6

1.2 Modeling and Solving Multi-agent Decisions 8

1.2.1 Decision in Single Agent Systems 8

1.2.2 Decision in Multi Agent Systems 10

1.2.3 Feedback Mechanisms for Computing Cost 14

1.3 Contributions 15

1.4 Summary 16

2 STUDYING THE EFFECT OF THE NETWORK STRUCTURE AND ABSTRACTION FUNCTION 18

2.1 Problem definition 18

2.1.1 Complexity 19

2.2 Proposed Solution 21

2.2.1 Architecture 23

2.2.2 Local Decision 25

2.2.3 State Abstraction 26

2.2.4 Task Decomposition 29

2.2.5 Learning 31

2.2.6 Neural Nets 34

Trang 8

2.2.7 Organization Structure 34

2.3 Experiments and Results 35

2.4 Related work 40

2.5 Conclusion 43

3 EXTENDING AND GENERALIZING MDP MODELS 44

3.1 Example 48

3.2 Semi Markov Decision Process, SMDP 49

3.3 Randomly available actions 51

3.3.1 The wait operator 56

3.4 Extension to Concurrent Action Model 56

3.5 Learning the Mediator’s Decision Process 58

3.5.1 Handling Multiple Tasks in Parallel 58

3.6 Results 60

3.6.1 The Taxi Domain 61

3.6.2 The DTAP Experiments 62

3.6.3 When Traditional SMDP Outperforms than ℘-SMDP 68

3.7 Related Work 69

3.8 Conclusion 70

4 LEARNING DECOMPOSITIONS 72

4.1 Motivating Example 74

4.2 Multi-level policy gradient algorithm 74

4.2.1 Learning 77

4.3 Cycles 80

4.4 Experimental Results 81

4.5 Related Work 87

4.6 Conclusion 88

5 WEIGHTED POLICY LEARNER, WPL 89

5.1 Game Theory 91

5.1.1 Learning and Convergence 93

5.2 The Weighted Policy Learner (WPL) algorithm 94

Trang 9

5.2.1 WPL Convergence 95

5.2.2 Analyzing WPL Using Differential Equations 97

5.3 Related Work 99

5.3.1 Generalized Infinitesimal Gradient Ascent, GIGA 102

5.3.2 GIGA-WoLF 102

5.4 Results 103

5.4.1 Computing Expected Reward 103

5.4.1.1 Fixing Learning Parameters 104

5.4.2 Benchmark Games 104

5.4.3 The Task Allocation Game 108

5.5 Conclusion 109

6 MULTI-STEP WEIGHTED POLICY LEARNING AND REORGANIZATION 114

6.1 Performance Evaluation 115

6.2 Optimizing Local Decision 116

6.3 Updating the State 117

6.4 MS-WPL Learning Algorithm 118

6.5 Re-Organization Algorithm 121

6.6 Algorithm Parameters 123

6.7 Experimental Results 124

6.7.1 MS-WPL 125

6.7.2 Re-Organization 132

6.8 Related Work 134

6.9 Conclusion 140

7 RELATED WORK 141

7.1 Scheduling 141

7.2 Task Allocation 142

7.3 Partially Observable MDP 143

7.4 Markovian Models for Multi-agent Systems 146

8 CONCLUSION 148

8.1 Summary 148

8.2 Contributions 153

Trang 10

8.3 Limitations and Future Work 155

APPENDICES

A SYMBOLIC ANALYSIS OF WPL DIFFERENTIAL

EQUATIONS 158

B SOLVING WPL DIFFERENTIAL EQUATIONS

NUMERICALLY USING MATHEMATICA 163

BIBLIOGRAPHY 166

Trang 11

LIST OF TABLES

3.1 Types of Tasks 63

3.2 Average number of decisions per time steps for different termination schemes and for different task arrival rate p 67

3.3 Reward gained using τchange and τany termination schemes, normalized (divided) by the reward gained using τall 67

5.1 2-action games 92

5.2 3-action games 106

5.3 T AT / dT AT for different values of N (columns) and u (rows) 109

6.1 Parameters 124

Trang 12

LIST OF FIGURES

1.1 Task allocation using a network of agents 4

1.2 Action hierarchy of both mediator (a) and server (b) 8

2.1 An Organization Hierarchy 22

2.2 An example of how an organization solves CDTAP 24

2.3 A mediator architecture 25

2.4 The recursive decision process of a mediator 25

2.5 Different Organization Structures 36

2.6 Average utility for random, greedy and learned policies and for different organizations 38

2.7 Learning curve 38

2.8 Utility standard deviation for random, greedy and learned policies and for different organizations 39

2.9 Messages average for random, greedy and learned policies and for different organizations 40

2.10 Average percentage of wasted resources for random, greedy and learned policies and for different organizations 40

2.11 Relationship between Hierarchical Reinforcement Learning and my approach 43

3.1 A network of mediators for assigning agents to tasks 46

3.2 The experiment scenario 48

Trang 13

3.3 The relationship between policies learned using τall,τcontinue,τany, and

τchange 58

3.4 The hierarchy of the joint action ¯aT0 ,T 4 60

3.5 The taxi domain 61

3.6 Performance of ℘-MDP in the taxi domain 63

3.7 The performance of different termination schemes when pT4 = 0.6 and the wait operator is disabled 65

3.8 The performance of different termination schemes when pT4 = 0.6 and the wait operator is enabled 66

3.9 The performance of different termination schemes when pT4 = 0.1 and the wait operator is enabled 66

3.10 The performance of the policy learned using τchange with available actions as part of the state (SMDP) and factored out of the state (℘-SMDP) 68 3.11 The performance of the policy learned using τall with available actions as part of the state (SMDP) and factored out of the state (℘-SMDP) 68

4.1 A network of agents that are responsible for assigning resources to incoming tasks 75

4.2 Agent decision with recursive decomposition 76

4.3 A large scale network of 100 resources and 20 agents 82

4.4 The effect of the dynamic learning rate 83

4.5 The effect of two level stochastic policies on performance 84

4.6 Policies of different agents 85

4.7 The effect of dynamic learning rate in the large system scenario 86

4.8 The effect of two level policies in the large system scenario 86

5.1 An example of distributed task allocation 90

5.2 An illustration of policy oscillation 94

Trang 14

5.3 An illustration of WPL convergence 98

5.4 An illustration of WPL convergence to the (0.9,0.9) NE in the p-q space: p on the horizontal axis and q on the vertical axis 100

5.5 An illustration of WPL convergence to the (0.9,0.9) NE (p(t) and q(t) on the vertical axis) against time (horizontal axis) 100

5.6 An illustration of WPL convergence for 10x10 NE(s) 101

5.7 Convergence of WPL in different two-player-two-action games The horizontal axis represents time The vertical axis represents the probability of choosing the first action, π(a1) 105

5.8 Convergence of the previous approaches in the tricky game The horizontal axis represents time The vertical axis represents the probability of choosing the first action, π(a1) 107

5.9 Convergence of GIGA-WoLF and WPL in the rock-paper-scissors game The horizontal axis represents time The vertical axis represents the probability of choosing the each action 111

5.10 Convergence of GIGA-WoLF and WPL in Shapley’s game The horizontal axis represents time The vertical axis represents the probability of choosing the each action 112

5.11 Convergence of GIGA-WoLF and WPL in distributed task allocation The horizontal axis represents time The vertical axis represents the reward received by each individual agent, which equals -TAT 113

6.1 Task allocation using a network of agents 115

6.2 ATST in the 2x2 grid for greedy, Q-learning, and MS-WPL 126

6.3 ATST in 2x2 grid for different values of |H| 127

6.4 AUPD in 2x2 grid for different values of |H| 127

6.5 ATST in 10x10 grid for different values of |H| 128

6.6 AREQ in 10x10 grid for different values of |H| 128

6.7 AUPD in 10x10 grid for different values of |H| 129

6.8 AREQ in 2x2 grid for different values of L 129

Trang 15

6.9 ATST in 2x2 grid for different values of L 130

6.10 ATST in 6x6 grid for different values of L 130

6.11 AUPD in 2x2 grid for different values of L 131

6.12 AUPD in 6x6 grid for different values of L 131

6.13 AREQ in 10x10 grid for different values of PO, boundary load 133

6.14 ATST in 10x10 grid for different values of PO, boundary load 133

6.15 AREQ in 10x10 grid, center load 134

6.16 ATST in 10x10 grid, center load 134

6.17 AUPD in 10x10 grid for different values of PO, boundary load 135

6.18 AUPD in 10x10 grid 135

6.19 Reorganization when load on boundary, at time 10,000 (left) and 290,000 (right) 136

6.20 Reorganization when load at center, at time 10,000 (left) and 290,000 (right) 136

A.1 An illustration of WPL convergence 158

A.2 Symbolic solution, using Mathematica, of the first set of differential equations 162

Trang 16

CHAPTER 1 INTRODUCTION

Many problems that an agent faces can be formulated as decision making lems, where an agent needs to decide which action to execute in order to maximizethe agent objective function The solution to the decision making problem is a policythat specifies which action to execute in each state When the system consists ofmultiple agents, the solution is then a set of policies or a joint policy, specifying whateach agent should do in every state

prob-I present in this dissertation frameworks and algorithms, based on multi-agentreinforcement learning, to solve the decision making problem approximately for largescale multi-agent systems The central assumption underlying my contributions isthat agents can be organized in an overlay network, where each agent optimizes itsown local decision by interacting only with neighboring agents (I will justify this as-sumption shortly) For simplicity, I have also assumed agents are cooperative Thisassumption permits focusing on scalability without worrying about competitive as-pects, such as malicious behavior, trust, and dividing profit However, the techniquesdeveloped in the dissertation, as I will discuss later, can be extended to handle some

of the issues that arise in competitive domains

Optimizing performance in a network of agents involves optimizing agent decisionsand optimizing the network (organization) itself The remainder of this section dis-cusses how to optimize the decision making problem The section also justifies the use

of an underlying organization in order to limit agent decisions, showing the interactionbetween optimizing the underlying organization and optimizing agent decisions

Trang 17

Solving a decision making problem consists of two components: a framework formodeling the decision process itself and algorithms for solving that model (i.e finding

a policy that maximizes performance) I have chosen the Markov Decision Process(MDP) model and its variants [78, 79, 77] because of their formality, simplicity, andgenerality The main idea of an MDP is to associate a reward with each action ineach state of the world The objective function is defined as the total reward anagent will get for following a policy An optimal policy is the policy that maximizesthe objective function The MDP framework also defines the transition of the worldfrom one state to another using a fixed probability distribution Section 1.2 providesmore details regarding the MDP model and its variants One of my contributions isidentifying certain limitations of existing MDP models and proposing a generalization

to the MDP model that leads to better performance for a specific class of problems(Chapter 3 provides more details)

After choosing a modeling framework, one needs to choose a methodology forfinding the optimal policy (given the underlying model) For MDP models thereare two main directions for finding the optimal policy: the planning (or the offline)approach and the reinforcement learning (RL or the online) approach The planningapproach assumes knowing the dynamics of the world a priori.1 Therefore, a planningagent can find the optimal policy before interacting with the environment This is astrong assumption in real domains due to the uncertainty of the world behavior Incontrast, a reinforcement learning agent (or a learning agent for short) interacts withthe world using an arbitrary initial policy, without knowing the world dynamics apriori The learning agent then uses this interaction to refine its policy gradually inorder to improve its performance I have chosen reinforcement learning as the basis

of my solution because of its applicability to real domains

1

e.g if the agent is in state s and executes action a, the agent’s state will transition to state s′with probability p(s, a, s ′ ) More details are in Section 1.2.

Trang 18

Reinforcement learning techniques have been successfully used to solve single agentdecision problems [78] Applying the single agent RL techniques in multi-agent sys-tems is possible and may work in some domains, but in general there is a need forreinforcement learning techniques that take into account the presence of other agents.This explains the recent growing interest in multi-agent reinforcement learning algo-rithms, or MARL [26, 73, 17, 16] The goal of my work is to achieve a scalable (interms of the number of agents) and stable (in terms of joint convergence as I willdescribe shortly) MARL framework.

Developing such a MARL framework is difficult because of two challenges: gence and scalability A MARL algorithm converges if all agents, which are executingthe MARL algorithm,2 will eventually stabilize to a joint policy.3 Analyzing conver-gence of MARL algorithms has recently been a topic of interest [17, 85, 7] One of mycontributions is a new MARL algorithm that outperforms the state of the art algo-rithms I theoretically analyze the algorithm’s convergence properties and provide aninformal proof of its convergence in a subclass of problems (Chapter 5) Convergence

conver-is more difficult when each agent does not have a global view of the system In such acase, each agent is said to have partial observability of the system, which is common

in real domains For example, suppose a group of agents can execute tasks and areinterconnected through an overlay network as illustrated by Figure 1.1 (this is aninstance of the distributed task allocation problem that is presented in Section 1.1).Agent A0 receives a task T1, and thinking neighbor A3 is underloaded, it sends T1

to A3 A0 receives reply message saying A3 is overloaded A0 may now switch itspolicy to send future requests to A1 However, by the time the next request comes,A3 may be underloaded and A1 becomes overloaded Therefore, A0 may switch itpolicy indefinitely without converging if the learning algorithm is not designed with

2

Agents are concurrently learning.

3 This stable joint policy is usually a Nash Equilibrium [17] as described in Chapter 5.

Trang 19

care I extend my MARL algorithm in order to take partial observability into account(Chapter 6).

A MARL framework is scalable if its performance degrades gracefully as the ber of learning agents grows A ”flat” multi-agent system where each agent interactswith all other agents, and observes their states and actions, will suffer an exponentialgrowth in state space and learning time One of the fundamental distinguishing char-acteristics of my work is that I limit the interaction between agents by imposing anoverlay network or an organization as shown in Figure 1.1 An agent in such an or-ganization interacts only with its immediate neighbors Using an organization, alongwith abstraction as I describe in Section 1.2, increases scalability by limiting the ex-plosion in state space For simplicity, I treat the organization as an overlay network,without worrying about restrictions imposed by the underlying domain It should benoted that in some cases it may be better for the organization to respect restrictionsimposed by the underlying domain For example, in packet routing [18, 61] it may

num-be num-better for the organization to reflect the underlying communication network andthe physical location of agents Also in the Grid domain [5, 31] the organization mayreflect the underlying administrative domains (e.g different universities)

A2

A0 A1

Trang 20

Although using an organization to restrict interaction between agents is essential

to solve the scalability problem, it introduces an additional problem: optimizing theorganization itself This problem is interdependent with optimizing the local decision

of each agent The organization defines the context for each agent, therefore straining its local decision The context of an agent a is the available informationand the available actions from a’s perspective This restricts the set of joint poli-cies that agents can learn in a given organization One of my contributions is thedevelopment of the first algorithm that uses information from reinforcement learn-ing to restructure the organization in order to maximize performance (Chapter 6).The main contribution of this thesis is an integrated and distributed framework foroptimizing both the organization and agent decisions in a large network of agents

con-To summarize, the goal of this dissertation is to optimize the performance of anetwork of agents using reinforcement learning I have pursued three complementarydirections for improving the performance of such a system: developing a better modelfor the local decision process of each agent, Chapter 3, developing better reinforce-ment learning algorithms for finding optimal policies, Chapter 4 and Chapter 5, anddeveloping an algorithm for reorganizing agents’ network, Chapter 6 Before describ-ing my contributions in further detail, and to make the discussion more concrete, thenext section describes the distributed task allocation problem (DTAP) that I will usethroughout the thesis as a motivating domain and for illustration Then Section 1.2reviews MDP models and the algorithms that solve them in further detail, relatingboth the models and the algorithms to my contributions The section also describes

a comprehensive MDP model of an agent decision problem when operating in a work Section 1.3 summarizes my contributions Finally, Section 1.4 provides a guide

net-to the dissertation

Trang 21

1.1 The Distributed Task Allocation Problem, DTAP

The distributed task allocation problem (DTAP) is to match incoming tasks todistributed servers Many application problems with varying complexities can bemapped to this abstract problem One example is the Grid domain, which consists

of a set of distributed servers connected through a high speed network [34] Tasks inthis domain are applications that appear at random locations and times requestingresources Another example is the Collaborative Adaptive Sensing of the Atmosphere(CASA) domain [88] In this domain, servers are a set of radars geographicallydistributed Tasks are meteorological phenomena that appear stochastically in spaceand time Radars need to be allocated to sense different phenomena

For illustration, consider the example scenario depicted in Figure 1.1 Agent A0receives task T1, which can be executed by any of named agents A0, A1, A2, A3,and A4 All agents other than A4 are overloaded This information is not known

to A0 because it does not interact directly with A4 From a global perspective, thebest action for A0 is to route the request through A2 to A4 The challenge here isthat A0 needs to realize this best action without knowing that A4 even exists Thework I present in this thesis will allow A0 to learn that Furthermore, this is done

in an integrated framework that concurrently optimizes the network connections sothat A0 becomes directly connected to A4 if this leads to a better performance, andthus local agent policies and the organization simultaneously evolve

Most of the previous approaches in DTAP [29, 50] relied on pure heuristics orexhaustive search assuming everything is known a priori [69] While few attempted

to use formal Markovian models to model DTAP [39, 32], their applicability waslimited because existing MDP models are inefficient in representing this problem.Some work attempts to solve this problem using a centralized solver/agent [11], which

is not scalable in large scale applications that are distributed by nature such as theGrid or CASA Recently, this problem has received attention in the Systems research

Trang 22

area, motivated by the Grid domain [58] However, the proposed solutions are eithercentralized or purely heuristic (Chapter 7 gives a broader overview of related work.Furthermore, each chapter reviews related state of the art in more detail) The workpresented here proposes a scalable solution using an underlying organization andmulti-agent reinforcement learning.

Figure 1.2 illustrates the hierarchy of actions in a DTAP agent (it should be notedthat this action hierarchy is conceptual, i.e agents may make multiple decisions atdifferent levels at the same time) I distinguish between two roles: a mediator and aserver A server agent is responsible for handling an actual resource, and therefore ittypically lays on the network edge A mediator agent is responsible for handling tasksand routing them across the network (i.e internal network nodes) This work focuses

on the mediator action hierarchy (Figure 1.2a) The first action level decides whichtasks to accept, reject or defer for future consideration The second level decides, for

a given task, which decomposition to choose if there are multiple ways to decompose

or partition a task The third and last level determines which neighbor to route a task

to Both servers and mediators share the highest level actions: accept or reject anincoming task The work in Chapter 3 focuses on level-0 actions and illustrates howrejecting a task can increase the total payoff by allowing future tasks to be accepted[62]

Learning a policy to optimize these decisions is challenging because of partialobservability, convergence and exogenous events [15] I have discussed the difficultyassociated with partial observability and convergence previously Exogenous eventsare events that change the system state and are not controllable by an agent’s actions.Task arrivals in DTAP are exogenous events Either implicitly or explicitly, agentsneed to take into account the arrival of future tasks in order to make an optimaldecision in the current time step For example, if at the current time step a task oflow value arrives then depending on the probability of a more valuable task to arrive

Trang 23

Accept/reject tasks

Level 1

Level 2

Decompose tasks

Schedule tasks

Route tasks

Domain specific actions Accept/reject tasks

Figure 1.2 Action hierarchy of both mediator (a) and server (b)

before the low value task finishes, an agent may accept or reject the low value task.Thus, this work will exploit the specific stochastic patterns of task arrival and theirentry points in the system to generate better task allocation policies

1.2 Modeling and Solving Multi-agent Decisions

The first step of finding the optimal policy, for a given decision process, is to modelthe decision process itself Markovian models are the most widely used because oftheir simplicity and generality These models respect the Markovian assumption,which means that to choose an optimal action for execution the agent only needs toknow the current state In other words, the agent can not achieve better performance

by remembering history This section first reviews Markovian models and learningalgorithms for single agent systems as an introduction to modeling decision processes

in multi-agent systems, the real focus of this dissertation, which is described next.Relationships to my contributions are established when possible

1.2.1 Decision in Single Agent Systems

An agent in a single agent system interacts with a stationary environment AMarkov decision process, or an MDP [78], has been used extensively for reasoningabout a single agent decision because of the model’s simplicity (four simple compo-

Trang 24

nents) and generality (almost any decision process can be expressed as an MDP) AnMDP is defined by the tuple hS, A, P, Ri S is the set of states, A(s) is the set ofactions available at a given state s, P (s, a, s′) is the probability of reaching state s′after executing action a at state s, and R(s, a, s′) is the average reward if the agentexecutes action a at state s and reaches state s′.

Several reinforcement learning algorithms have been developed for solving theMDP and finding the optimal policy, including Q-learning and Sarsa(λ) [78] Unfor-tunately, the MDP model makes a strong assumption that an agent observes the statecompletely In practice, usually an agent sees the state only partially For example,consider a driver on a highway At a given time the driver may observe a speed limitsign This sign may become unobservable few seconds later However, the driver stillneeds to register this observation (and the history of observations in general) to make

a decision of how to set her speed Partial observability is also common in large-scalemulti-agent systems where it is prohibitively expensive to make every agent aware ofthe state of every other agent in the system

Using an ordinary MDP in such situations, assuming what the agent observes isthe actual state, violates the Markovian assumption A partially observable MDPmodel, or POMDP, is an extension to the MDP model that takes into account par-tial observability by explicitly introducing the notion of observations A POMDP isdefined by the tuple hS, A, P, R, Ω, Oi The model introduces two more componentsthan an ordinary MDP: Ω and O Ω = {ω1, , ω|Ω|} is the set of observations Anobservation is any input to the agent, such as sensory input or received communica-tion messages The function O(ω|s, a, s′) is the probability of observing ω if the agentexecutes action a in state s and transitions to state s′

Although the POMDP model accurately captures partial observability, it is siderably more expensive to learn or compute the optimal policy using the POMDPmodel and usually approximation algorithms are used instead [51, 52, 43, 22, 44, 54,

Trang 25

con-9, 57, 4con-9, 27, 68] (more details are in Chapter 7) It should be noted that using simpleMDP learning algorithms such as Q-learning may lead to arbitrarily bad policies [6](primarily because most MDP learning algorithms find only deterministic policies).

In my work I efficiently find an approximate solution to Markovian models, despitepartial observability, by using gradient ascent [13] to learn stochastic policies Astochastic policy defines a probability distribution over actions, instead of choosing

a single action as a deterministic policy would do I also use limited history [51, 54]and eligibility tracing [54, 6] techniques to approximate the history of observations

In Chapter 3, I develop extensions to single agent Markovian models, which achieveexponential savings in time and space required for learning an optimal policy (in someclass of problems) Single agent reinforcement learning algorithms, however, are notalways applicable in a multi-agent system The next section illustrates this pointalong with my contributions to multi-agent learning

1.2.2 Decision in Multi Agent Systems

Chapter 7 reviews different Markovian models for multi-agent systems that requireeach agent to know about every other agent in the system I call these models thejoint Markovian models, such as the multi-agent MDP (MMDP)[14] and decentralizedMDP (DEC-MDP)[12] Because the main purpose of my work is to achieve scalablemulti-agent learning, I opted to model agent decision processes in a multi-agent sys-tem approximately as a collection of POMDPs (one POMDP for each agent).4 Thepresence of other agents, however, makes the environment non-stationary from a sin-gle agent perspective A POMDP is non-stationary if the dynamics of the environmentchange over time, i.e if any or both of the two probability functions O and P changeover time If one would fix the policy of all agents except for one learning agent, the

4

Note that in this section I use a collection of POMDPs for generality If the environment is observable by every agent, then similar arguments hold for a collection of MDPs.

Trang 26

POMDP of this learning agent becomes stationary As expected, non-stationaritymakes learning in multi-agent systems more challenging than in single agent systems,which has caused the introduction of multi-agent reinforcement learning algorithms.Multi-agent reinforcement learning (MARL) algorithms [23, 60, 73, 61, 17, 89,

42, 24, 16] takes the presence of other agents into account when analyzing mance Performance is measured in terms of the collected reward and the algorithmconvergence A MARL algorithm converges if all learning agents eventually stabilize

perfor-to a joint policy.5 Unfortunately, most of the MARL algorithms either guaranteeconvergence [42, 24] or maximize the collected reward [23, 60, 73, 61, 17, 89] but notboth Furthermore, most of the research in MARL uses game theory as the mainframework for analyzing and evaluating MARL algorithms, focusing on single stagegames The algorithms that do support multiple system states assume each agent canobserve the global state as well as other agent actions (full observability) Therefore,these algorithms are solving a set of non-stationary MDPs, not a set of non-stationaryPOMDPs

This thesis addresses these limitations as follows I present in chapter 5 WPL,

a new MARL algorithm that both maximizes collected reward and converges in asimple class of problems I then extend this algorithm into MS-WPL, based on ideasfrom stationary POMDP learning algorithms, to be able to handle non-stationaryPOMDPs And finally, I integrate this into a scalable framework that adapts theunderlying organization of agents in order to improve performance It should benoted that similar to other MARL algorithms, my algorithms may get trapped in alocal maxima6, because each agent is optimizing its own decision without having the

Trang 27

global view of the system However, I feel this is a reasonable price to pay in order

to achieve scalability

As mentioned before, I achieve scalability by limiting agent interactions through

an underlying organization or an overlay network Consequently, I have extended thedefinition of a POMDP to accommodate this assumption by defining an organizationalPOMDP, or an O-POMDP In the remainder of this section I will illustrate theO-POMDP model by applying it to the generic DTAP problem In the followingchapters, depending on which components of the O-POMDP are relevant to evaluatingeach of my contributions, I will simplify the model in order to focus attention onrelevant components

An O-POMDP for agent i is defined by a tuple hSi, Ai, Pi, Ri, Ωi, Oi, Ni, λii Thefirst six components are similar to the ordinary POMDP definition The last two com-ponents are introduced to exploit the organization structure: the set of neighboringagents Ni ⊆ M and the abstraction function λi The idea is that each agent optimizesits policy using (only) the state of its neighboring agents I introduce the notion ofstate abstraction in order to relax the restrictions imposed by this assumption Anabstraction function aggregates the state of neighbors to a less detailed state Thisprocess occurs recursively across the network, allowing an agent A to know aboutthe state of a distant agent C but with less detail than a neighboring agent B Itshould be noted that the purpose of the abstraction is to limit the state space of eachagent, and therefore making learning more tractable This assumption appears inmany distributed domains to reduce their complexity In the network routing domain

a router computes the delay to its immediate neighbors In human organizations,each department has its own state with little knowledge of other departments Nat-urally, whether this assumption holds or not depends on the network structure andhow the state is abstracted I propose REORG in Chapter 6, an algorithm that starts

Trang 28

with an arbitrary network structure/organization and then adapts this organization

Qi = {Ti

1, , Ti

|Q i |} is the set of available tasks in agent i’s Queue Each task Ti

j isdefined by a set of features that include its type yi

j, its payoff ri

j, and its originator oi

j(i.e the neighbor who sent this task to agent i) My work in [3] (Chapter 3) focuses

on Qi and shows how to extend existing models to achieve exponential savings (inboth time and space) and better policies

j,k is the state of subtask

k of the currently executing task j at agent i The state of a subtask is a surement of the progress of each subtask and can range from a simple value from{F AILED, P EN DIN G, SU CCEEDED} to the total running time of each sub-task

mea-Li is any additional local features of the agent that is not covered by Qi and

Ei For example, in a robot, Li may include battery status In this dissertation I

Trang 29

assume Li is null in a mediator since mediators are computational (and not physical)roles, but this is not necessary The following section discusses different feedbackmechanisms for computing cost (reward)

1.2.3 Feedback Mechanisms for Computing Cost

In order to learn, an agent needs feedback about the outcome of a selected action

If the action is associated with a neighbor, then different feedback mechanisms arepossible At one extreme, the associated neighbor could return its own collectedestimates of the potential outcomes (reward and state) of the action On the otherextreme, the neighbor could recursively ask its neighbors for their outcome estimatesuntil it reaches leaf nodes in the network (in this case the feedback is no longer anestimate, but the real outcome) Q-routing [18], which learns the routing tables in anetwork of routers, uses a depth-1 feedback In other words, a neighbor immediatelyreplies with its own estimate of a route length The feedback depth can go from 0,using a local estimate, to ∞, using an actual outcome

One can view the first extreme as a form of temporal difference (TD) learning[78] and the other extreme as a form of Monte-Carlo learning [78] The first feedbackextreme is the fastest and least accurate, while the other extreme is the slowest butthe most accurate My work uses a fixed feedback depth However, I use differentvalues in different aspects of my research.7

It should be noted that most of the previous work in coordination using Markovianmodels have tied feedback to requests (i.e when an agent requests a task from

a neighbor, it receives feedback with the task expected outcomes) [18] In somesituations it might be better to send feedback to a neighbor even if this neighbor hasnot sent a request This is particularly important when a published cost of a resource

7

One interesting future direction is to adaptively change feedback depth to balance the need for more responsive learning and more accurate policies One can view this as a form of limited depth A∗ search algorithm [65].

Trang 30

drops It may be the case that other neighbors are unlikely to request tasks from

a costly agent, and, therefore, will not get the feedback that the cost has recentlydropped On the other extreme, sending feedback to all neighbors every time stepleads to unnecessary overhead In Chapter 6 I propose a heuristic communicationmechanism that strikes a balance between the previous two extremes Experimentsverify that few communicated messages can result in large gain in payoff

1.3 Contributions

The main objective of this work is to achieve scalable multi-agent reinforcementlearning The following contributions will be made to that end

1 I experimentally show that different organizations affect the overall performance

of the same population of learning agents I also show that the performance of

a network of agents does improve using reinforcement learning (when compared

to greedy and random approaches), with random structures performing worst

2 I develop two improvements to existing MDP models: ℘-MDP and τchange ℘ −

M DP is a generalization of the MDP model [78] that can achieve an exponentialreduction in the state space τchange is an extension to the CAM [63] modelthat allows better policies to be found The benefits of both improvements areestablished through both theoretical analysis and experimental results in morethan one domain

3 I develop a new multi-agent reinforcement learning algorithm, W P L, that verges in a subclass of problems This is established through both theoreticalanalysis (which provide an informal proof of WPL’s convergence in a subclass ofproblems) and experimental results (which show that WPL outperforms existingmulti-agent reinforcement learning algorithms) WPL is extended to MS-WPLthat makes use of limited communication in order to improve performance I

Trang 31

con-experimentally show the benefits and the stability of MS-WPL in networks of

up to 100 agent

4 I develop a novel algorithm for restructuring the underlying agent network,REORG I experimentally show the effectiveness of the algorithm in improvingperformance

5 I show that both MS-WPL and REORG can work together in a concurrent andcomplementary way (REORG converges to an organization while MS-WPL con-verges to a policy) This is established by experimental results that demonstratethe stability of the joint execution of both algorithms

This chapter introduces the main goal of my thesis: scalable and stable multi-agentreinforcement learning This goal is motivated by the distributed task allocation prob-lem, which serves as a concrete domain for applying the techniques and contributions

I have developed in this thesis The chapter then discusses how to model the decisionproblem of an agent, in a system of concurrently learning agents, as a non-stationaryPOMDP I finally give an overview of the contributions I have made

In Chapter 2 I investigate a heuristic and approximate MARL approach usingQ-learning and neural nets, where I draw conclusions about the effect of the under-lying network on the system performance In Chapter 3 I formalize the mediatordecision problem, presenting extensions to existing Markovian models that make themodel more compact and lead to faster learning Chapter 4 investigates policy hillclimbing approaches as an alternative to Q-learning and shows its potential, leading

to the Weighted Policy Learning algorithm (WPL), a novel MARL algorithm that Idiscuss and analyze in Chapter 5 Chapter 6 concludes my contributions with thefirst integrated framework for optimizing the performance of a network of adaptive

Trang 32

agents using reinforcement learning, including Multi State WPL, an extended version

of WPL, and REORG, an algorithm for adapting the agents organization Chapter

7 discusses the related work and the state-of-the-art Finally, I summarize my thesiscontributions and discuss future directions in Chapter 8

Trang 33

CHAPTER 2 STUDYING THE EFFECT OF THE NETWORK

STRUCTURE AND ABSTRACTION FUNCTION

This chapter proposes a new approach for solving the DTAP approximately ing the MDP model and Q-learning [78] The main contributions presented in thischapter are a distributed algorithm for approximately solving the DTAP problemand an experimental study that evaluates the effect of the underlying organization onperformance This has been my first, and my least theoretical, step in developing aframework for scalable multi-agent reinforcement learning The chapter is organized

us-as follows Section 2.1 defines a continuous version of the DTAP problem that I willuse throughout the chapter Section 2.2 presents my proposed approach Section2.3 discusses the experimental results Finally, Section 2.4 compares my approach torelated work

2.1 Problem definition

I slightly modify the DTAP definition described before to define a continuousDTAP, or CDTAP, which I will use throughout this chapter Instead of each taskrequesting a set of (discrete) server types, a task requests certain amounts of eachserver type Some simplifying assumptions are made to avoid adding a schedulingproblem to CDTAP.1 Time is divided into episodes At the beginning of each episodethe system receives a sequence of tasks Once a task is allocated a coalition, members

of that coalition can not be assigned to another task until the end of the episode At

1 One of my future directions is to integrate scheduling in my framework.

Trang 34

the end of every episode all servers are freed and ready to be allocated to the nextsequence of tasks.

More formally, let T = hT1, T2, , Tqi be the sequence of tasks arriving in anepisode Each task Ti is defined by the tuple hui, rri,1, rri,2, , rri,mi, where ui isthe utility gained if task Ti is accomplished; and rri,k is the amount of resource krequired by task Ti Let I = {I1, I2, , In} be the set of server agents in the system.Each agent Ii is defined by the tuple hcri,1, cri,2, , cri,mi, where cri,k is the amount ofresource k controlled by server Ii The CDTAP problem is finding a subset of tasks

S ⊆ T that maximizes utility while satisfying the coalition constraints, i.e.:

2.1.1 Complexity

The Multi-dimensional Knapsack Problem, MDKP, is known to be NP-hard Theinput of this problem consists of a set of constraints C = {c1, c2, , cm} and a set

of objects O = {o1, o2, , oq}, where each object is defined by the tuple oi =<

ui, wi,1, wi,2, , wi,m >, where ui is its value and wi,j is its weight for dimension j.The goal is to find a subset of objects S ⊂ O, s.t P

o i ∈Sui is maximized, while

Trang 35

∀cj ∈ C,P

o i ∈Swi,j ≤ cj This section proves that CDTAP is NP-hard by reducingthe MDKP problem to the CDTAP problem as follows

Theorem 1 CDTAP is NP-hard

Proof The decision version of the MDKP problem is:

Definition 2 Q1: given a set of objects O and a set of constraints C, is there a validsubset of objects Sk that satisfy the constraints and has a total utility of k or more?The mapping from MDKP to CDTAP is as follows For each object oi =<

ui, wi,1, , wi,m > in MDKP, define a server ai =< wi,1, , wi,m > and a task Ti =<

ui, wi,1, , wi,m > Task T′ =< U, W1, , Wm > is also added, where U = P

o i ∈Ouiand Wj = (P

o i ∈Owi,j) − cj (this amount can be viewed as the gap between thedemand of a resource and its supply) As will be described shortly, T′ encodes theconstraints of the MDKP instance such that the coalition assigned to this task corre-sponds to the set of objects left outside the knapsack The CDTAP decision problemthen becomes:

Definition 3 Q2: given the set of tasks T and the set of servers A, is there a a validset of Coalitions ¯C that results in U + k utility or more?

To prove the theorem, one needs to show that the answer to Q1 is yes iff the answer

to Q2 is yes Let ¯Ck = {{ai} : oi ∈ Sk} be the set of coalitions corresponding to Sk(i.e ¯Ck is a set of singular coalitions) Let C−k = {ai : oi ∈ S/ k}, i.e the coalitioncorresponding to all objects not in Sk By definition, every coalition {ai} ∈ ¯Ck can

be assigned to Ti, resulting in k utility The hard part is to prove that the constraints

of the MDKP problem are not violated by this assignment This is where T′ comesinto play If Sk satisfies the MDKP constraints, then

Trang 36

∀j, Xi:o i ∈S k

wi,j ≤ cj

∴∀j Xi:o i ∈O

Because the CDTAP is NP-hard, an optimal algorithm will need exponential time

in the worst case (unless NP = P) An approximation algorithm, which can exploitinformation about the problem, is needed If the environment (in terms of incomingtask classes and patterns) does not follow any statistical model, and agents continuallyand rapidly enter and exit the system, there is little information to be exploited.Luckily, in many real applications the environment does follow a model, and thesystem can be assumed closed

In such cases, it is intuitive to take advantage of this stability and organize theagents in order to guide the search for future coalitions Here I focus on hierarchicalorganizations (next chapters study more generic networks) Hierarchical organizationsare distributed, scalable, and easier to abstract their state as will be described shortly.Figure 2.1 shows a sample hierarchical organization A server (the leaves in Figure2.1) represents the resources controlled by a single agent A mediator (shown as acircle in Figure 2.1) is a computational role, which can be executed on any serveragent, or on dedicated computing systems A mediator acts on behalf of both servers

Trang 37

and mediators beneath it when it comes to interaction with higher mediators in thehierarchy.

Each mediator has a local policy that determines the order by which to decompose

a high-level task into subtasks and allocate these subtasks to the mediator’s children.The combination of these local policies constitutes a global hierarchical policy ofthe whole system This section describes the architecture of a mediator agent andthe underlying algorithms so that such a global policy is eventually reached throughlocal optimizations First I define some organizational terms that will be used later.The following section describes the mediator architecture Then Sections 2.2.2-2.2.7present the underlying algorithms

Figure 2.1 An Organization Hierarchy

Each mediator M has a set of children, children(M ), which is the set of nodesdirectly linked below it So for instance, in the organization shown in Figure 2.1,children(M 6) = {I12, I13}, while children(M 3) = {M 4, M 5, M 6} Conversely, eachchild C has a parent mediators(C) For example, mediators(M 4) = M 3 For com-pleteness, children of a server are the empty set, and so is the parent of a root node.Each agent A (either a mediator or a server) controls, either directly or indirectly,

a set of servers, cluster(A) (i.e., the leaves reachable from agent A) In the exampleabove, cluster(M 6) = {I12, I13}, cluster(M 3) = {I7, I8, I9, I10, I11, I12, I13}, andcluster(I6) = {I6} Also for each agent A, I define members(A) to be the set of all

Trang 38

agents reachable from A In the above example, members(M 3) = {M 3, M 4, M 5, M 6,I7, I8, I9, I10, I11, I12, I13}.

Figure 2.2 shows how a group of mediator agents, organized in a hierarchy, cancooperate to allocate set of servers (coalition) to undertake an incoming task Atask T is discovered by agent M 0 Agent M 0 chooses child M 3, using its localpolicy, to form a sub-coalition and decomposes task T accordingly to subtask T 0 =hutility = 100, requestedResource1 = 50, requestedResource2 = 150i Then agent

M 3 chooses its child M 5 to form a sub-coalition for T 0, and decomposes T 0 intosubtask T5 = hu5 = 50, rr5,1 = 0, rr5,2 = 100i M 5 returns a committed coalition

CT5 = {I10, I11} The process continues until the whole task T is allocated Finally,

M 0 integrates all sub-coalitions into CT and sends it back to originator of T

2.2.1 Architecture

Figure 2.3 illustrates a block diagram of a mediator’s architecture in the system.There is a handler for every child Each handler includes a neural net, which approx-imates the value of choosing the corresponding child to form a sub-coalition (for thetask at hand) The weights of a neural net are optimized using reinforcement learn-ing, as will be described in the following section To speedup learning using neuralnets, the state encoder encodes the current state differently for each child, depending

on the amount of resources available at that child I will come back to that later inSection 2.2.5

The state aggregator aggregates a mediator state before it is sent to other ators up in the hierarchy When a mediator mh receives an aggregated state fromits child m, mediator mh will store the aggregated state in the abstract state field ofchild m’s handler The current state of a mediator is a combination of the abstractstates of its children and the current status of the task at hand (i.e the resourcesthe task requires and not yet allocated and the utility to be gained if the remainder

Trang 39

medi-M6 M5

M3 M0

I10 I11 I12 I13

M6 M5

M3 M0

I10 I11 I12 I13

subtask T0 to agent M3

M6 M5

M3 M0

I10 I11 I12 I13 M3 decomposes taskT0 into subtask T1

and asks M5 to achieve it.

T0

M6 M5

M3 M0

I10 I11 I12 I13 M5 successfully achieves subtask T1, and sends results R1 back to M3, including an update to M5’s

current state

T1

R1

M6 M5

M3 M0

I10 I11 I12 I13 M3 decomposes the rest of task T1 into subtask T2 and asks another agent to achieve it

The process continues until all T is achieved

M6 M5

M3 M0

I10 I11 I12 I13

At the end, task T is achieved M0 integrates all results into R, and sends it to the environment

T2

R

Figure 2.2 An example of how an organization solves CDTAP

of the task is completed) The task decomposer stores an arriving task When thelocal policy chooses a child to form a sub-coalition, the task decomposer decomposesthe stored task for this child (storing it in the child’s handler) More details of theoperation of a mediator in the following section

Trang 40

Figure 2.3 A mediator architecture.

2.2.2 Local Decision

Algorithm 1, on page 19, describes the decision process used by mediator A inthe organization once it receives a task TA Figure 2.4 illustrates the algorithm Thealgorithm works as follows

Figure 2.4 The recursive decision process of a mediator

Định dạng
Số trang	188
Dung lượng	5,07 MB