Causal message delivery and total order broadcast are first pre-sented in one chapter.. 6.1.2 A Distributed Execution Is a Partial Order on Local Events 1226.1.3 Causal Past, Causal Futu
Trang 1Michel Raynal
Distributed
Algorithms for
Message-Passing Systems
Trang 2for Message-Passing Systems
Trang 3Distributed Algorithms
for Message-Passing Systems
Trang 4Institut Universitaire de France
Springer Heidelberg New York Dordrecht London
Library of Congress Control Number: 2013942973
ACM Computing Classification (1998): F.1, D.1, B.3
© Springer-Verlag Berlin Heidelberg 2013
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect
pub-to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media ( www.springer.com )
Trang 5La profusion des choses cachait la rareté des idées et l’usure des croyances.
[ ] Retenir quelque chose du temps ó l’on ne sera plus.
In Les années (2008), Annie Ernaux
Nel mezzo del cammin di nostra vita
Mi ritrovai per una selva oscura, Ché la diritta via era smarritta.
In La divina commedia (1307–1321), Dante Alighieri (1265–1321)
Wir müssen nichts sein, sondern alles werden wollen.
Johann Wolfgang von Goethe (1749–1832)
Chaque génération, sans doute, se croit vouée à refaire le monde.
La mienne sait pourtant qu’elle ne le refera pas Mais sa tâche est peut-être plus grande.
Elle consiste à empêcher que le monde ne se défasse.
Speech at the Nobel Banquet, Stockholm, December 10, 1957, Albert Camus (1913–1960)
Rien n’est précaire comme vivre Rien comme être n’est passager C’est un peu fondre pour le givre
Ou pour le vent être léger J’arrive ó je suis étranger.
In Le voyage de Hollande (1965), Louis Aragon (1897–1982)
What Is Distributed Computing? Distributed computing was born in the late1970s when researchers and practitioners started taking into account the intrinsiccharacteristic of physically distributed systems The field then emerged as a special-ized research area distinct from networking, operating systems, and parallel com-puting
Distributed computing arises when one has to solve a problem in terms of
dis-tributed entities (usually called processors, nodes, processes, actors, agents, sors, peers, etc.) such that each entity has only a partial knowledge of the manyparameters involved in the problem that has to be solved While parallel computing
sen-and real-time computing can be characterized, respectively, by the terms efficiency and on-time computing, distributed computing can be characterized by the term un-
certainty This uncertainty is created by asynchrony, multiplicity of control flows,
Trang 6absence of shared memory and global time, failure, dynamicity, mobility, etc tering one form or another of uncertainty is pervasive in all distributed computingproblems A main difficulty in designing distributed algorithms comes from the factthat each entity cooperating in the achievement of a common goal cannot have in-stantaneous knowledge of the current state of the other entities; it can only knowtheir past local states.
Mas-Although distributed algorithms are often made up of a few lines, their behaviorcan be difficult to understand and their properties hard to state and prove Hence,distributed computing is not only a fundamental topic but also a challenging topicwhere simplicity, elegance, and beauty are first-class citizens
Why This Book? While there are a lot of books on sequential computing (both onbasic data structures, or algorithms), this is not the case in distributed computing.Most books on distributed computing consider advanced topics where the uncer-tainty inherent to distributed computing is created by the net effect of asynchronyand failures It follows that these books are more appropriate for graduate studentsthan for undergraduate students
The aim of this book is to present in a comprehensive way basic notions, conceptsand algorithms of distributed computing when the distributed entities cooperate bysending and receiving messages on top of an underlying network In this case, themain difficulty comes from the physical distribution of the entities and the asyn-chrony of the environment in which they evolve
Audience This book has been written primarily for people who are not familiarwith the topic and the concepts that are presented These include mainly:
• Senior-level undergraduate students and graduate students in computer science
or computer engineering, who are interested in the principles and foundations ofdistributed computing
• Practitioners and engineers who want to be aware of the state-of-the-art concepts,basic principles, mechanisms, and techniques encountered in distributed comput-ing
Prerequisites for this book include undergraduate courses on algorithms, and sic knowledge on operating systems Selections of chapters for undergraduate andgraduate courses are suggested in the section titled “How to Use This Book” in the
ba-Afterword
Content As already indicated, this book covers algorithms, basic principles, andfoundations of message-passing programming, i.e., programs where the entitiescommunicate by sending and receiving messages through a network The world isdistributed, and the algorithmic thinking suited to distributed applications and sys-tems is not reducible to sequential computing Knowledge of the bases of distributedcomputing is becoming more important than ever as more and more computer ap-plications are now distributed The book is composed of six parts
Trang 7• The aim of the first part, which is made up of six chapters, is to give a feel for thenature of distributed algorithms, i.e., what makes them different from sequential
or parallel algorithms To that end, it mainly considers distributed graph rithms In this context, each node of the graph is a process, which has to compute
algo-a result whose mealgo-aning depends on the whole gralgo-aph
Basic distributed algorithms such as network traversals, shortest-path rithms, vertex coloring, knot detection, etc., are first presented Then, a generalframework for distributed graph algorithms is introduced A chapter is devoted toleader election algorithms on a ring network, and another chapter focuses on thenavigation of a network by mobile objects
algo-• The second part is on the nature of distributed executions It is made up of fourchapters In some sense, this part is the core of the book It explains what a dis-tributed execution is, the fundamental notion of a consistent global state, and theimpossibility—without freezing the computation—of knowing whether a com-puted consistent global state has been passed through by the execution or not.Then, this part of the book addresses an important issue of distributed compu-tations, namely the notion of logical time: scalar (linear) time, vector time, andmatrix time Each type of time is analyzed and examples of their uses are given
A chapter, which extends the notion of a global state, is then devoted to chronous distributed checkpointing Finally, the last chapter of this part showshow to simulate a synchronous system on top of an asynchronous system (suchsimulators are called synchronizers)
asyn-• The third part of the book is made up of two chapters devoted to distributedmutual exclusion and distributed resource allocation Different families ofpermission-based mutual exclusion algorithms are presented The notion of anadaptive algorithm is also introduced The notion of a critical section with mul-tiple entries, and the case of resources with a single or several instances is alsopresented Associated deadlock prevention techniques are introduced
• The fourth part of the book is on the definition and the implementation of nication operations whose abstraction level is higher than the simple send/receive
commu-of messages These communication abstractions impose order constraints on sage deliveries Causal message delivery and total order broadcast are first pre-sented in one chapter Then, another chapter considers synchronous communica-tion (also called rendezvous or logically instantaneous communication)
mes-• The fifth part of the book, which is made up of two chapters, is on the detection
of stable properties encountered in distributed computing A stable property is aproperty that, once true, remains true forever The properties which are studied arethe detection of the termination of a distributed computation, and the detection ofdistributed deadlock This part of the book is strongly related to the second part(which is devoted to the notion of a global state)
• The sixth and last part of the book, which is also made up of two chapters, isdevoted to the notion of a distributed shared memory The aim is here to pro-vide the entities (processes) with a set of objects that allow them to cooperate at
Trang 8an abstraction level more appropriate than the use of messages Two consistencyconditions, which can be associated with these objects, are presented and inves-tigated, namely, atomicity (also called linearizability) and sequential consistency.Several algorithms implementing these consistency conditions are described.
To have a more complete feeling of the spirit of this book, the reader is invited
to consult the section “The Aim of This Book” in theAfterword, which describeswhat it is hoped has been learned from this book Each chapter starts with a shortpresentation and a list of the main keywords, and terminates with a summary of itscontent Each of the six parts of the book is also introduced by a brief description ofits aim and its technical content
Acknowledgments This book originates from lecture notes for undergraduate and graduatecourses on distributed computing that I give at the University of Rennes (France) and, as aninvited professor, at several universities all over the world I would like to thank the studentsfor their questions that, in one way or another, have contributed to this book I want also tothank Ronan Nugent (Springer) for his support and his help in putting it all together.Last but not least (and maybe most importantly), I also want to thank all the researcherswhose results are presented in this book Without their work, this book would not exist
Michel RaynalProfesseur des UniversitésInstitut Universitaire de FranceIRISA-ISTIC, Université de Rennes 1Campus de Beaulieu, 35042, Rennes, France
March–October 2012Rennes, Saint-Grégoire, Tokyo, Fukuoka (AINA’12), Arequipa (LATIN’12),Reykjavik (SIROCCO’12), Palermo (CISIS’12), Madeira (PODC’12), Lisbon,
Douelle, Saint-Philibert, Rhodes Island (Europar’12),Salvador de Bahia (DISC’12), Mexico City (Turing Year at UNAM)
Trang 9Part I Distributed Graph Algorithms
1 Basic Definitions and Network Traversal Algorithms 3
1.1 Distributed Algorithms 3
1.1.1 Definition 3
1.1.2 An Introductory Example: Learning the Communication Graph 6
1.2 Parallel Traversal: Broadcast and Convergecast 9
1.2.1 Broadcast and Convergecast 9
1.2.2 A Flooding Algorithm 10
1.2.3 Broadcast/Convergecast Based on a Rooted Spanning Tree 10 1.2.4 Building a Spanning Tree 12
1.3 Breadth-First Spanning Tree 16
1.3.1 Breadth-First Spanning Tree Built Without Centralized Control 17
1.3.2 Breadth-First Spanning Tree Built with Centralized Control 20 1.4 Depth-First Traversal 24
1.4.1 A Simple Algorithm 24
1.4.2 Application: Construction of a Logical Ring 27
1.5 Summary 32
1.6 Bibliographic Notes 32
1.7 Exercises and Problems 33
2 Distributed Graph Algorithms 35
2.1 Distributed Shortest Path Algorithms 35
2.1.1 A Distributed Adaptation of Bellman–Ford’s Shortest Path Algorithm 35
2.1.2 A Distributed Adaptation of Floyd–Warshall’s Shortest Paths Algorithm 38
2.2 Vertex Coloring and Maximal Independent Set 42
2.2.1 On Sequential Vertex Coloring 42
Trang 102.2.2 Distributed ( + 1)-Coloring of Processes 43
2.2.3 Computing a Maximal Independent Set 46
2.3 Knot and Cycle Detection 50
2.3.1 Directed Graph, Knot, and Cycle 50
2.3.2 Communication Graph, Logical Directed Graph, and Reachability 51
2.3.3 Specification of the Knot Detection Problem 51
2.3.4 Principle of the Knot/Cycle Detection Algorithm 52
2.3.5 Local Variables 53
2.3.6 Behavior of a Process 54
2.4 Summary 57
2.5 Bibliographic Notes 58
2.6 Exercises and Problems 58
3 An Algorithmic Framework to Compute Global Functions on a Process Graph 59
3.1 Distributed Computation of Global Functions 59
3.1.1 Type of Global Functions 59
3.1.2 Constraints on the Computation 60
3.2 An Algorithmic Framework 61
3.2.1 A Round-Based Framework 61
3.2.2 When the Diameter Is Not Known 64
3.3 Distributed Determination of Cut Vertices 66
3.3.1 Cut Vertices 66
3.3.2 An Algorithm Determining Cut Vertices 67
3.4 Improving the Framework 69
3.4.1 Two Types of Filtering 69
3.4.2 An Improved Algorithm 70
3.5 The Case of Regular Communication Graphs 72
3.5.1 Tradeoff Between Graph Topology and Number of Rounds 72 3.5.2 De Bruijn Graphs 73
3.6 Summary 75
3.7 Bibliographic Notes 76
3.8 Problem 76
4 Leader Election Algorithms 77
4.1 The Leader Election Problem 77
4.1.1 Problem Definition 77
4.1.2 Anonymous Systems: An Impossibility Result 78
4.1.3 Basic Assumptions and Principles of the Election Algorithms 79
4.2 A Simple O(n2)Leader Election Algorithm for Unidirectional Rings 79
4.2.1 Context and Principle 79
4.2.2 The Algorithm 80
4.2.3 Time Cost of the Algorithm 80
Trang 114.2.4 Message Cost of the Algorithm 81
4.2.5 A Simple Variant 82
4.3 An O(n log n) Leader Election Algorithm for Bidirectional Rings 83 4.3.1 Context and Principle 83
4.3.2 The Algorithm 84
4.3.3 Time and Message Complexities 85
4.4 An O(n log n) Election Algorithm for Unidirectional Rings 86
4.4.1 Context and Principles 86
4.4.2 The Algorithm 88
4.4.3 Discussion: Message Complexity and FIFO Channels 89
4.5 Two Particular Cases 89
4.6 Summary 90
4.7 Bibliographic Notes 90
4.8 Exercises and Problems 91
5 Mobile Objects Navigating a Network 93
5.1 Mobile Object in a Process Graph 93
5.1.1 Problem Definition 93
5.1.2 Mobile Object Versus Mutual Exclusion 94
5.1.3 A Centralized (Home-Based) Algorithm 94
5.1.4 The Algorithms Presented in This Chapter 95
5.2 A Navigation Algorithm for a Complete Network 96
5.2.1 Underlying Principles 96
5.2.2 The Algorithm 97
5.3 A Navigation Algorithm Based on a Spanning Tree 100
5.3.1 Principles of the Algorithm: Tree Invariant and Proxy Behavior 101
5.3.2 The Algorithm 102
5.3.3 Discussion and Properties 104
5.3.4 Proof of the Algorithm 106
5.4 An Adaptive Navigation Algorithm 108
5.4.1 The Adaptivity Property 109
5.4.2 Principle of the Implementation 109
5.4.3 An Adaptive Algorithm Based on a Distributed Queue 111
5.4.4 Properties 113
5.4.5 Example of an Execution 114
5.5 Summary 115
5.6 Bibliographic Notes 115
5.7 Exercises and Problems 116
Part II Logical Time and Global States in Distributed Systems 6 Nature of Distributed Computations and the Concept of a Global State 121
6.1 A Distributed Execution Is a Partial Order on Local Events 122
6.1.1 Basic Definitions 122
Trang 126.1.2 A Distributed Execution Is a Partial Order on Local Events 122
6.1.3 Causal Past, Causal Future, Concurrency, Cut 123
6.1.4 Asynchronous Distributed Execution with Respect to Physical Time 125
6.2 A Distributed Execution Is a Partial Order on Local States 127
6.3 Global State and Lattice of Global States 129
6.3.1 The Concept of a Global State 129
6.3.2 Lattice of Global States 129
6.3.3 Sequential Observations 131
6.4 Global States Including Process States and Channel States 132
6.4.1 Global State Including Channel States 132
6.4.2 Consistent Global State Including Channel States 133
6.4.3 Consistent Global State Versus Consistent Cut 134
6.5 On-the-Fly Computation of Global States 135
6.5.1 Global State Computation Is an Observation Problem 135
6.5.2 Problem Definition 136
6.5.3 On the Meaning of the Computed Global State 136
6.5.4 Principles of Algorithms Computing a Global State 137
6.6 A Global State Algorithm Suited to FIFO Channels 138
6.6.1 Principle of the Algorithm 138
6.6.2 The Algorithm 140
6.6.3 Example of an Execution 141
6.7 A Global State Algorithm Suited to Non-FIFO Channels 143
6.7.1 The Algorithm and Its Principles 144
6.7.2 How to Compute the State of the Channels 144
6.8 Summary 146
6.9 Bibliographic Notes 146
6.10 Exercises and Problems 147
7 Logical Time in Asynchronous Distributed Systems 149
7.1 Linear Time 149
7.1.1 Scalar (or Linear) Time 150
7.1.2 From Partial Order to Total Order: The Notion of a Timestamp 151
7.1.3 Relating Logical Time and Timestamps with Observations 152 7.1.4 Timestamps in Action: Total Order Broadcast 153
7.2 Vector Time 159
7.2.1 Vector Time and Vector Clocks 159
7.2.2 Vector Clock Properties 162
7.2.3 On the Development of Vector Time 163
7.2.4 Relating Vector Time and Global States 165
7.2.5 Vector Clocks in Action: On-the-Fly Determination of a Global State Property 166
7.2.6 Vector Clocks in Action: On-the-Fly Determination of the Immediate Predecessors 170 7.3 On the Size of Vector Clocks 173
Trang 137.3.1 A Lower Bound on the Size of Vector Clocks 174
7.3.2 An Efficient Implementation of Vector Clocks 176
7.3.3 k-Restricted Vector Clock 181
7.4 Matrix Time 182
7.4.1 Matrix Clock: Definition and Algorithm 182
7.4.2 A Variant of Matrix Time in Action: Discard Old Data 184
7.5 Summary 186
7.6 Bibliographic Notes 186
7.7 Exercises and Problems 187
8 Asynchronous Distributed Checkpointing 189
8.1 Definitions and Main Theorem 189
8.1.1 Local and Global Checkpoints 189
8.1.2 Z-Dependency, Zigzag Paths, and Z-Cycles 190
8.1.3 The Main Theorem 192
8.2 Consistent Checkpointing Abstractions 196
8.2.1 Z-Cycle-Freedom 196
8.2.2 Rollback-Dependency Trackability 197
8.2.3 On Distributed Checkpointing Algorithms 198
8.3 Checkpointing Algorithms Ensuring Z-Cycle Prevention 199
8.3.1 An Operational Characterization of Z-Cycle-Freedom 199
8.3.2 A Property of a Particular Dating System 199
8.3.3 Two Simple Algorithms Ensuring Z-Cycle Prevention 201
8.3.4 On the Notion of an Optimal Algorithm for Z-Cycle Prevention 203
8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 203
8.4.1 Rollback-Dependency Trackability (RDT) 203
8.4.2 A Simple Brute Force RDT Checkpointing Algorithm 205
8.4.3 The Fixed Dependency After Send (FDAS) RDT Checkpointing Algorithm 206
8.4.4 Still Reducing the Number of Forced Local Checkpoints 207
8.5 Message Logging for Uncoordinated Checkpointing 211
8.5.1 Uncoordinated Checkpointing 211
8.5.2 To Log or Not to Log Messages on Stable Storage 211
8.5.3 A Recovery Algorithm 214
8.5.4 A Few Improvements 215
8.6 Summary 216
8.7 Bibliographic Notes 216
8.8 Exercises and Problems 217
9 Simulating Synchrony on Top of Asynchronous Systems 219
9.1 Synchronous Systems, Asynchronous Systems, and Synchronizers 219 9.1.1 Synchronous Systems 219
9.1.2 Asynchronous Systems and Synchronizers 221
9.1.3 On the Efficiency Side 222
Trang 149.2 Basic Principle for a Synchronizer 223
9.2.1 The Main Problem to Solve 223
9.2.2 Principle of the Solutions 224
9.3 Basic Synchronizers: α and β 224
9.3.1 Synchronizer α 224
9.3.2 Synchronizer β 227
9.4 Advanced Synchronizers: γ and δ 230
9.4.1 Synchronizer γ 230
9.4.2 Synchronizer δ 234
9.5 The Case of Networks with Bounded Delays 236
9.5.1 Context and Hypotheses 236
9.5.2 The Problem to Solve 237
9.5.3 Synchronizer λ 238
9.5.4 Synchronizer μ 239
9.5.5 When the Local Physical Clocks Drift 240
9.6 Summary 242
9.7 Bibliographic Notes 243
9.8 Exercises and Problems 244
Part III Mutual Exclusion and Resource Allocation 10 Permission-Based Mutual Exclusion Algorithms 247
10.1 The Mutual Exclusion Problem 247
10.1.1 Definition 247
10.1.2 Classes of Distributed Mutex Algorithms 248
10.2 A Simple Algorithm Based on Individual Permissions 249
10.2.1 Principle of the Algorithm 249
10.2.2 The Algorithm 251
10.2.3 Proof of the Algorithm 252
10.2.4 From Simple Mutex to Mutex on Classes of Operations 255
10.3 Adaptive Mutex Algorithms Based on Individual Permissions 256
10.3.1 The Notion of an Adaptive Algorithm 256
10.3.2 A Timestamp-Based Adaptive Algorithm 257
10.3.3 A Bounded Adaptive Algorithm 259
10.3.4 Proof of the Bounded Adaptive Mutex Algorithm 262
10.4 An Algorithm Based on Arbiter Permissions 264
10.4.1 Permissions Managed by Arbiters 264
10.4.2 Permissions Versus Quorums 265
10.4.3 Quorum Construction 266
10.4.4 An Adaptive Mutex Algorithm Based on Arbiter Permissions 268
10.5 Summary 273
10.6 Bibliographic Notes 273
10.7 Exercises and Problems 274
Trang 1511 Distributed Resource Allocation 277
11.1 A Single Resource with Several Instances 277
11.1.1 The k-out-of-M Problem 277
11.1.2 Mutual Exclusion with Multiple Entries: The 1-out-of-M Mutex Problem 278
11.1.3 An Algorithm for the k-out-of-M Mutex Problem 280
11.1.4 Proof of the Algorithm 283
11.1.5 From Mutex Algorithms to k-out-of-M Algorithms 285
11.2 Several Resources with a Single Instance 285
11.2.1 Several Resources with a Single Instance 286
11.2.2 Incremental Requests for Single Instance Resources: Using a Total Order 287
11.2.3 Incremental Requests for Single Instance Resources: Reducing Process Waiting Chains 290
11.2.4 Simultaneous Requests for Single Instance Resources and Static Sessions 292
11.2.5 Simultaneous Requests for Single Instance Resources and Dynamic Sessions 293
11.3 Several Resources with Multiple Instances 295
11.4 Summary 297
11.5 Bibliographic Notes 298
11.6 Exercises and Problems 299
Part IV High-Level Communication Abstractions 12 Order Constraints on Message Delivery 303
12.1 The Causal Message Delivery Abstraction 303
12.1.1 Definition of Causal Message Delivery 304
12.1.2 A Causality-Based Characterization of Causal Message Delivery 305
12.1.3 Causal Order with Respect to Other Message Ordering Constraints 306
12.2 A Basic Algorithm for Point-to-Point Causal Message Delivery 306
12.2.1 A Simple Algorithm 306
12.2.2 Proof of the Algorithm 309
12.2.3 Reduce the Size of Control Information Carried by Messages 310
12.3 Causal Broadcast 313
12.3.1 Definition and a Simple Algorithm 313
12.3.2 The Notion of a Causal Barrier 315
12.3.3 Causal Broadcast with Bounded Lifetime Messages 317
12.4 The Total Order Broadcast Abstraction 320
12.4.1 Strong Total Order Versus Weak Total Order 320
12.4.2 An Algorithm Based on a Coordinator Process or a Circulating Token 322
Trang 1612.4.3 An Inquiry-Based Algorithm 324
12.4.4 An Algorithm for Synchronous Systems 326
12.5 Playing with a Single Channel 328
12.5.1 Four Order Properties on a Channel 328
12.5.2 A General Algorithm Implementing These Properties 329
12.6 Summary 332
12.7 Bibliographic Notes 332
12.8 Exercises and Problems 333
13 Rendezvous (Synchronous) Communication 335
13.1 The Synchronous Communication Abstraction 335
13.1.1 Definition 335
13.1.2 An Example of Use 337
13.1.3 A Message Pattern-Based Characterization 338
13.1.4 Types of Algorithms Implementing Synchronous Communications 341
13.2 Algorithms for Nondeterministic Planned Interactions 341
13.2.1 Deterministic and Nondeterministic Communication Contexts 341
13.2.2 An Asymmetric (Static) Client–Server Implementation 342
13.2.3 An Asymmetric Token-Based Implementation 345
13.3 An Algorithm for Nondeterministic Forced Interactions 350
13.3.1 Nondeterministic Forced Interactions 350
13.3.2 A Simple Algorithm 350
13.3.3 Proof of the Algorithm 352
13.4 Rendezvous with Deadlines in Synchronous Systems 354
13.4.1 Synchronous Systems and Rendezvous with Deadline 354
13.4.2 Rendezvous with Deadline Between Two Processes 355
13.4.3 Introducing Nondeterministic Choice 358
13.4.4 n-Way Rendezvous with Deadline 360
13.5 Summary 361
13.6 Bibliographic Notes 361
13.7 Exercises and Problems 362
Part V Detection of Properties on Distributed Executions 14 Distributed Termination Detection 367
14.1 The Distributed Termination Detection Problem 367
14.1.1 Process and Channel States 367
14.1.2 Termination Predicate 368
14.1.3 The Termination Detection Problem 369
14.1.4 Types and Structure of Termination Detection Algorithms 369 14.2 Termination Detection in the Asynchronous Atomic Model 370
14.2.1 The Atomic Model 370
Trang 1714.2.2 The Four-Counter Algorithm 371
14.2.3 The Counting Vector Algorithm 373
14.2.4 The Four-Counter Algorithm vs the Counting Vector Algorithm 376
14.3 Termination Detection in Diffusing Computations 376
14.3.1 The Notion of a Diffusing Computation 376
14.3.2 A Detection Algorithm Suited to Diffusing Computations 377 14.4 A General Termination Detection Algorithm 378
14.4.1 Wave and Sequence of Waves 379
14.4.2 A Reasoned Construction 381
14.5 Termination Detection in a Very General Distributed Model 385
14.5.1 Model and Nondeterministic Atomic Receive Statement 385
14.5.2 The Predicate fulfilled() 387
14.5.3 Static vs Dynamic Termination: Definition 388
14.5.4 Detection of Static Termination 390
14.5.5 Detection of Dynamic Termination 393
14.6 Summary 396
14.7 Bibliographic Notes 396
14.8 Exercises and Problems 397
15 Distributed Deadlock Detection 401
15.1 The Deadlock Detection Problem 401
15.1.1 Wait-For Graph (WFG) 401
15.1.2 AND and OR Models Associated with Deadlock 403
15.1.3 Deadlock in the AND Model 403
15.1.4 Deadlock in the OR Model 404
15.1.5 The Deadlock Detection Problem 404
15.1.6 Structure of Deadlock Detection Algorithms 405
15.2 Deadlock Detection in the One-at-a-Time Model 405
15.2.1 Principle and Local Variables 406
15.2.2 A Detection Algorithm 406
15.2.3 Proof of the Algorithm 407
15.3 Deadlock Detection in the AND Communication Model 408
15.3.1 Model and Principle of the Algorithm 409
15.3.2 A Detection Algorithm 409
15.3.3 Proof of the Algorithm 411
15.4 Deadlock Detection in the OR Communication Model 413
15.4.1 Principle 413
15.4.2 A Detection Algorithm 416
15.4.3 Proof of the Algorithm 419
15.5 Summary 421
15.6 Bibliographic Notes 421
15.7 Exercises and Problems 422
Trang 18Part VI Distributed Shared Memory
16 Atomic Consistency (Linearizability) 427
16.1 The Concept of a Distributed Shared Memory 427
16.2 The Atomicity Consistency Condition 429
16.2.1 What Is the Issue? 429
16.2.2 An Execution Is a Partial Order on Operations 429
16.2.3 Atomicity: Formal Definition 430
16.3 Atomic Objects Compose for Free 432
16.4 Message-Passing Implementations of Atomicity 435
16.4.1 Atomicity Based on a Total Order Broadcast Abstraction 435
16.4.2 Atomicity of Read/Write Objects Based on Server Processes 437
16.4.3 Atomicity Based on a Server Process and Copy Invalidation 438
16.4.4 Introducing the Notion of an Owner Process 439
16.4.5 Atomicity Based on a Server Process and Copy Update 443
16.5 Summary 444
16.6 Bibliographic Notes 444
16.7 Exercises and Problems 445
17 Sequential Consistency 447
17.1 Sequential Consistency 447
17.1.1 Definition 447
17.1.2 Sequential Consistency Is Not a Local Property 449
17.1.3 Partial Order for Sequential Consistency 450
17.1.4 Two Theorems for Sequentially Consistent Read/Write Registers 451
17.1.5 From Theorems to Algorithms 453
17.2 Sequential Consistency from Total Order Broadcast 453
17.2.1 A Fast Read Algorithm for Read/Write Objects 453
17.2.2 A Fast Write Algorithm for Read/Write Objects 455
17.2.3 A Fast Enqueue Algorithm for Queue Objects 456
17.3 Sequential Consistency from a Single Server 456
17.3.1 The Single Server Is a Process 456
17.3.2 The Single Server Is a Navigating Token 459
17.4 Sequential Consistency with a Server per Object 460
17.4.1 Structural View 460
17.4.2 The Object Managers Must Cooperate 461
17.4.3 An Algorithm Based on the OO Constraint 462
17.5 A Weaker Consistency Condition: Causal Consistency 464
17.5.1 Definition 464
17.5.2 A Simple Algorithm 466
17.5.3 The Case of a Single Object 467
17.6 A Hierarchy of Consistency Conditions 468
Trang 1917.7 Summary 468
17.8 Bibliographic Notes 469
17.9 Exercises and Problems 470
Afterword 471
The Aim of This Book 471
Most Important Concepts, Notions, and Mechanisms Presented in This Book 471
How to Use This Book 473
From Failure-Free Systems to Failure-Prone Systems 474
A Series of Books 474
References 477
Index 495
Trang 20no-op no operation
a, b pair with two elements a and b
m1; ; m q sequence of messages
a i [1 s] array of size s (local to process p i)
for each i ∈ {1, , m} order irrelevant
for each i from 1 to m order relevant
return(v) returns v and terminates the operation invocation
¬(a R b) relation R does not include the pair a, b
Trang 21Fig 1.1 Three graph types of particular interest 4
Fig 1.2 Synchronous execution (left) vs asynchronous (right) execution 5 Fig 1.3 Learning the communication graph (code for p i) 7
Fig 1.4 A simple flooding algorithm (code for p i) 10
Fig 1.5 A rooted spanning tree 11
Fig 1.6 Tree-based broadcast/convergecast (code for p i) 11
Fig 1.7 Construction of a rooted spanning tree (code for p i) 13
Fig 1.8 Left: Underlying communication graph; Right: Spanning tree 14
Fig 1.9 An execution of the algorithm constructing a spanning tree 14
Fig 1.10 Two different spanning trees built from the same communication graph 16
Fig 1.11 Construction of a breadth-first spanning tree without centralized control (code for p i) 18
Fig 1.12 An execution of the algorithm of Fig.1.11 19
Fig 1.13 Successive waves launched by the root process p a 21
Fig 1.14 Construction of a breadth-first spanning tree with centralized control (starting code) 22
Fig 1.15 Construction of a breadth-first spanning tree with centralized control (code for a process p i) 22
Fig 1.16 Depth-first traversal of a communication graph (code for p i) 25
Fig 1.17 Time and message optimal depth-first traversal (code for p i) 27
Fig 1.18 Management of the token at process p i 29
Fig 1.19 From a depth-first traversal to a ring (code for p i) 29
Fig 1.20 Sense of direction of the ring and computation of routing tables 30 Fig 1.21 An example of a logical ring construction 31
Fig 1.22 An anonymous network 34
Fig 2.1 Bellman–Ford’s dynamic programming principle 36
Fig 2.2 A distributed adaptation of Bellman–Ford’s shortest path algorithm (code for p i) 37
Fig 2.3 A distributed synchronous shortest path algorithm (code for p i) 38 Fig 2.4 Floyd–Warshall’s sequential shortest path algorithm 39
Trang 22Fig 2.5 The principle that underlies Floyd–Warshall’s shortest paths
algorithm 39
Fig 2.7 Sequential ( + 1)-coloring of the vertices of a graph 42Fig 2.8 Distributed ( + 1)-coloring from an initial m-coloring where
n ≥ m ≥ + 2 43
Fig 2.10 Examples of maximal independent sets 46Fig 2.11 From m-coloring to a maximal independent set (code for p i) 47
independent set (code for p i) 48
Fig 2.14 A directed graph with a knot 51Fig 2.15 Possible message pattern during a knot detection 53Fig 2.16 Asynchronous knot detection (code of p i) 55Fig 2.17 Knot/cycle detection: example 57
(code for p i) 63Fig 3.2 A diameter-independent generic algorithm (code for p i) 65Fig 3.3 A process graph with three cut vertices 66Fig 3.4 Determining cut vertices: principle 67Fig 3.5 An algorithm determining the cut vertices (code for p i) 68Fig 3.6 A general algorithm with filtering (code for p i) 71
(code for p i) 75Fig 4.1 Chang and Robert’s election algorithm (code for p i) 80Fig 4.2 Worst identity distribution for message complexity 81Fig 4.3 A variant of Chang and Robert’s election algorithm (code for p i) 83
Fig 4.5 Competitors at the end of round r are at distance greater than 2 r 84Fig 4.6 Hirschberg and Sinclair’s election algorithm (code for p i) 85Fig 4.7 Neighbor processes on the unidirectional ring 87Fig 4.8 From the first to the second round 87Fig 4.9 Dolev, Klawe, and Rodeh’s election algorithm (code for p i) 88Fig 4.10 Index-based randomized election (code for p i) 90
Fig 5.2 Structural view of the navigation algorithm
(module at process p i) 98Fig 5.3 A navigation algorithm for a complete network (code for p i) 99
Fig 5.5 Navigation tree: initial state 101Fig 5.6 Navigation tree: after the object has moved to p c 102Fig 5.7 Navigation tree: proxy role of a process 102
Trang 23Fig 5.8 A spanning tree-based navigation algorithm (code for p i) 104Fig 5.9 The case of non-FIFO channels 105
R = [d(i1), d(i2), , d(i x−1), d(i x ), 0, , 0] 108Fig 5.11 A dynamically evolving spanning tree 110Fig 5.12 A navigation algorithm based on a distributed queue
(code for p i) 112Fig 5.13 From the worst to the best case 113Fig 5.14 Example of an execution 114Fig 5.15 A hybrid navigation algorithm (code for p i) 117Fig 6.1 A distributed execution as a partial order 124Fig 6.2 Past, future, and concurrency sets associated with an event 125Fig 6.3 Cut and consistent cut 126Fig 6.4 Two instances of the same execution 126Fig 6.5 Consecutive local states of a process p i 127Fig 6.6 From a relation on events to a relation on local states 128Fig 6.7 A two-process distributed execution 130Fig 6.8 Lattice of consistent global states 130Fig 6.9 Sequential observations of a distributed computation 131Fig 6.10 Illustrating the notations “e ∈ σ i ” and “f ∈ σ i” 133Fig 6.11 In-transit and orphan messages 133Fig 6.12 Cut versus global state 135Fig 6.13 Global state computation: structural view 136Fig 6.14 Recording of a local state 139Fig 6.15 Reception of aMARKER()message: case 1 139Fig 6.16 Reception of aMARKER()message: case 2 139Fig 6.17 Global state computation (FIFO channels, code for cp i) 140Fig 6.18 A simple automaton for process p i (i = 1, 2) 141Fig 6.19 Prefix of a simple execution 142
on a distributed execution 142Fig 6.21 Consistent cut associated with the computed global state 143Fig 6.22 A rubber band transformation 143Fig 6.23 Global state computation (non-FIFO channels, code for cp i) 145Fig 6.24 Example of a global state computation (non-FIFO channels) 145
(non-FIFO channels, code for cp i) 148Fig 7.1 Implementation of a linear clock (code for process p i) 150Fig 7.2 A simple example of a linear clock system 151Fig 7.3 A non-sequential observation obtained from linear time 152
Fig 7.5 Total order broadcast: the problem that has to be solved 155Fig 7.6 Structure of the total order broadcast implementation 155Fig 7.7 Implementation of total order broadcast (code for process p i) 157Fig 7.8 To_delivery predicate of a message at process p i 157
Trang 24Fig 7.9 Implementation of a vector clock system (code for process p i) 160Fig 7.10 Time propagation in a vector clock system 161Fig 7.11 On the development of time (1) 164Fig 7.12 On the development of time (2) 164Fig 7.13 Associating vector dates with global states 165Fig 7.14 First global state satisfying a global predicate (1) 167Fig 7.15 First global state satisfying a global predicate (2) 168Fig 7.16 Detection the first global state satisfying
i LP i (code for process p i) 169Fig 7.17 Relevant events in a distributed computation 171Fig 7.18 Vector clock system for relevant events (code for process p i) 171Fig 7.19 From relevant events to Hasse diagram 171
(code for process p i) 172Fig 7.21 Four possible cases when updating imp i [k],
while vc i [k] = vc[k] 173Fig 7.22 A specific communication pattern 175Fig 7.23 Specific communication pattern with n= 3 processes 175Fig 7.24 Management of vc i [1 n] and kprime i [1 n, 1 n]
(code for process p i): Algorithm 1 178Fig 7.25 Management of vc i [1 n] and kprime i [1 n, 1 n]
(code for process p i): Algorithm 2 179Fig 7.26 An adaptive communication layer (code for process p i) 181Fig 7.27 Implementation of a k-restricted vector clock system
(code for process p i) 182Fig 7.28 Matrix time: an example 183Fig 7.29 Implementation of matrix time (code for process p i) 184Fig 7.30 Discarding obsolete data: structural view (at a process p i) 185Fig 7.31 A buffer management algorithm (code for process p i) 185Fig 7.32 Yet another clock system (code for process p i) 188
Fig 8.2 A zigzag pattern 192
a zigzag path joining two local checkpoints of LC 194
a zigzag path joining two local checkpoints 195Fig 8.5 Domino effect (in a system of two processes) 196Fig 8.6 Proof by contradiction of Theorem11 200
(code for p i) 201Fig 8.8 To take or not to take a forced local checkpoint 202Fig 8.9 An example of z-cycle prevention 202Fig 8.10 A vector clock system for rollback-dependency trackability
(code for p i) 204Fig 8.11 Intervals and vector clocks for rollback-dependency trackability 204
Trang 25Fig 8.12 Russell’s pattern for ensuring the RDT consistency condition 205Fig 8.13 Russell’s checkpointing algorithm (code for p i) 205Fig 8.14 FDAS checkpointing algorithm (code for p i) 207Fig 8.15 Matrix causal i [1 n, 1 n] 208Fig 8.16 Pure (left) vs impure (right) causal paths from p j to p i 208Fig 8.17 An impure causal path from p i to itself 209Fig 8.18 An efficient checkpointing algorithm for RDT (code for p i) 210Fig 8.19 Sender-based optimistic message logging 212Fig 8.20 To log or not to log a message? 212Fig 8.21 An uncoordinated checkpointing algorithm (code for p i) 214Fig 8.22 Retrieving the messages which are in transit
with respect to the pair (c i , c j ) 215
Fig 9.2 Synchronous breadth-first traversal algorithm (code for p i) 221Fig 9.3 Synchronizer: from asynchrony to logical synchrony 222Fig 9.4 Synchronizer α (code for p i) 226Fig 9.5 Synchronizer α: possible message arrival at process p i 227Fig 9.6 Synchronizer β (code for p i) 229
(but not with α): Case 1 229
(but not with α): Case 2 229Fig 9.9 Synchronizer γ : a communication graph 230Fig 9.10 Synchronizer γ : a partitioning 231Fig 9.11 Synchronizer γ (code for p i) 233Fig 9.12 Synchronizer δ (code for p i) 235Fig 9.13 Initialization of physical clocks (code for p i) 236Fig 9.14 The scenario to be prevented 237Fig 9.15 Interval during which a process can receive pulse r messages 238Fig 9.16 Synchronizer λ (code for p i) 239Fig 9.17 Synchronizer μ (code for p i) 240Fig 9.18 Clock drift with respect to reference time 241Fig 10.1 A mutex invocation pattern and the three states of a process 248Fig 10.2 Mutex module at a process p i: structural view 250
(code for p i) 251Fig 10.4 Proof of the safety property of the algorithm of Fig.10.3 253Fig 10.5 Proof of the liveness property of the algorithm of Fig.10.3 254
(code for p i) 256
(code for p i) 258Fig 10.8 Non-FIFO channel in the algorithm of Fig.10.7 259Fig 10.9 States of the messagePERMISSION( {i, j}) 260
Trang 26Fig 10.10 A bounded adaptive algorithm based on individual permissions
(code for p i) 261Fig 10.11 Arbiter permission-based mechanism 265
Fig 10.13 An order two projective plane 267Fig 10.14 A safe (but not live) mutex algorithm
based on arbiter permissions (code for p i) 269Fig 10.15 Permission preemption to prevent deadlock 270Fig 10.16 A mutex algorithm based on arbiter permissions (code for p i) 272Fig 11.1 An algorithm for the multiple entries mutex problem
(code for p i) 279Fig 11.2 Sending pattern ofNOT_USED()messages: Case 1 281Fig 11.3 Sending pattern ofNOT_USED()messages: Case 2 282Fig 11.4 An algorithm for the k-out-of-M mutex problem (code for p i) 282Fig 11.5 Examples of conflict graphs 286Fig 11.6 Global conflict graph 287Fig 11.7 A deadlock scenario involving two processes and two resources 288Fig 11.8 No deadlock with ordered resources 289Fig 11.9 A particular pattern in using resources 289Fig 11.10 Conflict graph for six processes, each resource being shared
by two processes 290Fig 11.11 Optimal vertex-coloring of a resource graph 291Fig 11.12 Conflict graph for static sessions (SS_CG) 292Fig 11.13 Simultaneous requests in dynamic sessions
(sketch of code for p i) 296Fig 11.14 Algorithms for generalized k-out-of-M (code for p i) 297
(code for p i) 299Fig 12.1 The causal message delivery order property 304Fig 12.2 The delivery pattern prevented by the empty interval property 305Fig 12.3 Structure of a causal message delivery implementation 307Fig 12.4 An implementation of causal message delivery (code for p i) 308Fig 12.5 Message pattern for the proof of the causal order delivery 309Fig 12.6 An implementation reducing the size of control information
(code for p i) 312Fig 12.7 Control information carried by consecutive messages sent by p j
to p i 312Fig 12.8 An adaptive sending procedure for causal message delivery 313Fig 12.9 Illustration of causal broadcast 313Fig 12.10 A simple algorithm for causal broadcast (code for p i) 314Fig 12.11 The causal broadcast algorithm in action 315Fig 12.12 The graph of immediate predecessor messages 316Fig 12.13 A causal broadcast algorithm based on causal barriers
(code for p i) 317Fig 12.14 Message with bounded lifetime 318
Trang 27Fig 12.15 On-time versus too late 319Fig 12.16 A -causal broadcast algorithm (code for p i) 320Fig 12.17 Implementation of total order message delivery requires
coordination 321Fig 12.18 Total order broadcast based on a coordinator process 322Fig 12.19 Token-based total order broadcast 323Fig 12.20 Clients and servers in total order broadcast 324Fig 12.21 A total order algorithm from clients p i to servers q j 326Fig 12.22 A total order algorithm from synchronous systems 327
(cannot bypass other messages) 329Fig 12.25 Message m with typemarker 329Fig 12.26 Building a first in first out channel 330Fig 12.27 Message delivery according to message types 331
messages as “points” instead of “intervals” 336
Fig 13.4 A crown of size k = 2 (left) and a crown of size k = 3 (right) 339Fig 13.5 Four message patterns 340Fig 13.6 Implementation of a rendezvous when the client is the sender 344Fig 13.7 Implementation of a rendezvous when the client is the receiver 345Fig 13.8 A token-based mechanism to implement an interaction 346Fig 13.9 Deadlock and livelock prevention in interaction implementation 347Fig 13.10 A general token-based implementation for planned interactions
(rendezvous) 348Fig 13.11 An algorithm for forced interactions (rendezvous) 351Fig 13.12 Forced interaction: message pattern when i > j 352Fig 13.13 Forced interaction: message pattern when i < j 352
(two-process symmetric algorithm) 356Fig 13.15 Real-time rendezvous between two processes p and q 357
(asymmetric algorithm) 358Fig 13.17 Nondeterministic rendezvous with deadline 359Fig 13.18 Multirendezvous with deadline 361Fig 13.19 Comparing two date patterns for rendezvous with deadline 363Fig 14.1 Process states for termination detection 368Fig 14.2 Global structure of the observation modules 370Fig 14.3 An execution in the asynchronous atomic model 371Fig 14.4 One visit is not sufficient 371Fig 14.5 The four-counter algorithm for termination detection 372Fig 14.6 Two consecutive inquiries 373
Trang 28Fig 14.7 The counting vector algorithm for termination detection 374Fig 14.8 The counting vector algorithm at work 375Fig 14.9 Termination detection of a diffusing computation 378Fig 14.10 Ring-based implementation of a wave 380Fig 14.11 Spanning tree-based implementation of a wave 381
1≤i≤n idle x i ) ⇒ TERM(C, τ x )is not true 382Fig 14.13 A general algorithm for termination detection 384Fig 14.14 Atomicity associated with τ i x 384Fig 14.15 Structure of the channels to p i 386Fig 14.16 An algorithm for static termination detection 391Fig 14.17 Definition of time instants for the safety of static termination 392Fig 14.18 Cooperation between local observers 394Fig 14.19 An algorithm for dynamic termination detection 395Fig 14.20 Example of a monotonous distributed computation 398Fig 15.1 Examples of wait-for graphs 402
in the AND communication model 410Fig 15.3 Determining in-transit messages 411
(with no application messages in transit) 411Fig 15.5 Time instants in the proof of the safety property 412Fig 15.6 A directed communication graph 414Fig 15.7 Network traversal with feedback on a static graph 414Fig 15.8 Modification in a wait-for graph 415Fig 15.9 Inconsistent observation of a dynamic wait-for graph 416Fig 15.10 An algorithm for deadlock detection
in the OR communication model 418Fig 15.11 Activation pattern for the safety proof 420Fig 15.12 Another example of a wait-for graph 423Fig 16.1 Structure of a distributed shared memory 428Fig 16.2 Register: What values can be returned by read operations? 429Fig 16.3 The relation−→ of the computation described in Fig.op 16.2 430Fig 16.4 An execution of an atomic register 432Fig 16.5 Another execution of an atomic register 432Fig 16.6 Atomicity allows objects to compose for free 435Fig 16.7 From total order broadcast to atomicity 436Fig 16.8 Why read operations have to be to-broadcast 437Fig 16.9 Invalidation-based implementation of atomicity:
message flow 438Fig 16.10 Invalidation-based implementation of atomicity: algorithm 440Fig 16.11 Invalidation and owner-based implementation of atomicity
(code of p i) 441Fig 16.12 Invalidation and owner-based implementation of atomicity
(code of the manager p X) 442Fig 16.13 Update-based implementation of atomicity 443
Trang 29Fig 16.14 Update-based algorithm implementing atomicity 444Fig 17.1 A sequentially consistent computation (which is not atomic) 448Fig 17.2 A computation which is not sequentially consistent 449Fig 17.3 A sequentially consistent queue 449Fig 17.4 Sequential consistency is not a local property 450Fig 17.5 Part of the graph G used in the proof of Theorem29 452Fig 17.6 Fast read algorithm implementing sequential consistency
(code for p i) 454Fig 17.7 Fast write algorithm implementing sequential consistency
(code for p i) 456Fig 17.8 Fast enqueue algorithm implementing a sequentially consistent
queue (code for p i) 457Fig 17.9 Read/write sequentially consistent registers
from a central manager 458Fig 17.10 Pattern of read/write accesses used in the proof of Theorem33 459Fig 17.11 Token-based sequentially consistent shared memory
(code for p i) 460Fig 17.12 Architectural view associated with the OO constraint 461Fig 17.13 Why the object managers must cooperate 461Fig 17.14 Sequential consistency with a manager per object:
process side 462Fig 17.15 Sequential consistency with a manager per object:
manager side 463
Fig 17.17 An example of a causally consistent computation 465Fig 17.18 Another example of a causally consistent computation 466Fig 17.19 A simple algorithm implementing causal consistency 467Fig 17.20 Causal consistency for a single object 467Fig 17.21 Hierarchy of consistency conditions 469
Trang 30Distributed Graph Algorithms
This first part of the book is on distributed graph algorithms These algorithms sider the distributed system as a connected graph whose vertices are the processes(nodes) and whose edges are the communication channels It is made up of fivechapters
con-After having introduced base definitions, Chap.1addresses network traversals
It presents distributed algorithms that realize parallel, depth-first, and breadth-firstnetwork traversals Chapter2 is on distributed algorithms solving classical graphproblems such as shortest paths, vertex coloring, maximal independent set, andknot detection This chapter shows that the distributed techniques to solve graphproblems are not obtained by a simple extension of their sequential counterparts.Chapter3presents a general technique to compute a global function on a processgraph, each process providing its own input parameter, and obtaining its own output(which depends on the whole set of inputs) Chapter4is on the leader election prob-lems with a strong emphasis on uni/bidirectional rings Finally, the last chapter ofthis part, Chap.5, presents several algorithms that allow a mobile object to navigate
a network
In addition to the presentation of distributed graph algorithms, which can be used
in distributed applications, an aim of this part of the book is to allow readers to have
a better intuition of the term distributed when comparing distributed algorithms and
sequential algorithms
Trang 31Basic Definitions
and Network Traversal Algorithms
This chapter first introduces basic definitions related to distributed algorithms Then,considering a distributed system as a graph whose vertices are the processes andwhose edges are the communication channels, it presents distributed algorithms forgraph traversals, namely, parallel traversal, breadth-first traversal, and depth-firsttraversal It also shows how spanning trees or rings can be constructed from thesedistributed graph traversal algorithms These trees and rings can, in turn, be used toeasily implement broadcast and convergecast algorithms
As the reader will see, the distributed graph traversal techniques are differentfrom their sequential counterparts in their underlying principles, behaviors, andcomplexities This come from the fact that, in a distributed context, the same type oftraversal can usually be realized in distinct ways, each with its own tradeoff betweenits time complexity and message complexity
Keywords Asynchronous/synchronous system· Breadth-first traversal ·
Broadcast· Convergecast · Depth-first traversal · Distributed algorithm ·
Forward/discard principle· Initial knowledge · Local algorithm ·
Parallel traversal· Spanning tree · Unidirectional logical ring
1.1 Distributed Algorithms
1.1.1 Definition
Processes A distributed system is made up of a collection of computing units,
each one abstracted through the notion of a process The processes are assumed to
cooperate on a common goal, which means that they exchange information in oneway or another
The set of processes is static It is composed of n processes and denoted Π=
{p1, , p n }, where each p i, 1≤ i ≤ n, represents a distinct process Each process
p i is sequential, i.e., it executes one step at a time.
The integer i denotes the index of process p i, i.e., the way an external observer
can distinguish processes It is nearly always assumed that each process p i has its
own identity, denoted id i ; then p i knows id i (in a lot of cases—but not always—
id i = i).
Trang 32Fig 1.1 Three graph types
of particular interest
Communication Medium The processes communicate by sending and receiving
messages through channels Each channel is assumed to be reliable (it does not
create, modify, or duplicate messages)
In some cases, we assume that channels are first in first out (FIFO) which means
that the messages are received in the order in which they have been sent Eachchannel is assumed to be bidirectional (can carry messages in both directions) and
to have an infinite capacity (can contain any number of messages, each of any size)
In some particular cases, we will consider channels which are unidirectional (suchchannels carry messages in one direction only)
Each process p i has a set of neighbors, denoted neighbors i According to the
context, this set contains either the local identities of the channels connecting p i toits neighbor processes or the identities of these processes
Structural View It follows from the previous definitions that, from a structuralpoint of view, a distributed system can be represented by a connected undirected
graph G = (Π, C) (where C denotes the set of channels) Three types of graph are
of particular interest (Fig.1.1):
• A ring is a graph in which each process has exactly two neighbors with which it
can communicate directly, a left neighbor and a right neighbor
• A tree is a graph that has two noteworthy properties: it is acyclic and connected
(which means that adding a new channel would create a cycle while suppressing
a channel would disconnect it)
• A fully connected graph is a graph in which each process is directly connected to
every other process (In graph terminology, such a graph is called a clique.)
Distributed Algorithm A distributed algorithm is a collection of n automata, one
per process An automaton describes the sequence of steps executed by the sponding process
corre-In addition to the power of a Turing machine, an automaton is enriched with twocommunication operations which allows it to send a message on a channel or receive
a message on any channel The operations aresend()andreceive()
Synchronous Algorithm A distributed synchronous algorithm is an algorithm
de-signed to be executed on a synchronous distributed system The progress of such asystem is governed by an external global clock, and the processes collectively exe-
cute a sequence of rounds, each round corresponding to a value of the global clock.
Trang 33Fig 1.2 Synchronous execution (left) vs asynchronous (right) execution
During a round, a process sends at most one message to each of its neighbors The
fundamental property of a synchronous system is that a message sent by a process during a round r is received by its destination process during the very same round r Hence, when a process proceeds to the round r+ 1, it has received (and processed)
all the messages which have been sent to it during round r, and it knows that the
same is true for any process
Space/time Diagram A distributed execution can be graphically represented by
what is called a space/time diagram Each sequential progress is represented by
an arrow from left to right, and a message is represented by an arrow from thesending process to the destination process These notions will be made more precise
in Chap.6
The space/time diagram on the left of Fig.1.2represents a synchronous tion The vertical lines are used to separate the successive rounds During the first
execu-round, p1sends a message to p3, and p2sends a message to p1, etc
Asynchronous Algorithm A distributed asynchronous algorithm is an algorithm
designed to be executed on an asynchronous distributed system In such a system,there is no notion of an external time That is why asynchronous systems are some-
times called time-free systems.
In an asynchronous algorithm, the progress of a process is ensured by its owncomputation and the messages it receives When a process receives a message, itprocesses the message and, according to its local algorithm, possibly sends mes-sages to its neighbors
A process processes one message at a time This means that the processing of amessage cannot be interrupted by the arrival of another message When a messagearrives, it is added to the input buffer of the receiving process It will be processedafter all the messages that precede it in this buffer have been processed
The space/time diagram of a simple asynchronous execution is depicted on theright of Fig.1.2 One can see that, in this example, the messages from p1to p2
are not received in their sending order Hence, the channel from p1to p2is not aFIFO (first in first out) channel It is easy to see from the figure that a synchronousexecution is more structured than an asynchronous execution
Initial Knowledge of a Process When solving a problem in a chronous system, a process is characterized by its input parameters (which are re-
synchronous/asyn-lated to the problem to solve) and its initial knowledge of its environment.
Trang 34This knowledge concerns its identity, the total number n of processes, the identity
of its neighbors, the structure of the communication graph, etc As an example, a
process p imay only know that
• it is on a unidirectional ring,
• it has a left neighbor from which it can receive messages,
• it has a right neighbor to which it can send messages,
• its identity is id i,
• the fact that no two processes have the same identity, and
• the fact that the set of identities is totally ordered
As we can see, with such an initial knowledge, no process initially knows the total
number of processes n Learning this number requires the processes to exchange
information
1.1.2 An Introductory Example:
Learning the Communication Graph
As a simple example, this section presents an asynchronous algorithm that allowseach process to learn the communication graph in which it evolves It is assumedthat the channels are bidirectional and that the communication graph is connected(there is a path from any process to any other process)
Initial Knowledge Each process p i has identity id i , and no process knows n (the total number of processes) Initially, a process p i knows its identity and the iden-
tity id j of each of its neighbors Hence, each process p i is initially provided with
a set neighbors i and, for each id j ∈ neighbors i, the pairid i , id j denotes locally
the channel connecting p i to p j Let us observe that, as the channels are tional, bothid i , id j and id j , id i denote the same channel and are consequentlyconsidered as synonyms
bidirec-The Forward/Discard Principle The principle on which the algorithm relies ispretty simple: Each process initially sends its position in the graph to each of its
neighbors This position is represented by the pair (id i , neighbors i )
Then, when a process p i receives a pair (id k , neighbors k )for the first time, it dates its local representation of the communication graph and forwards the message
up-it has received to all up-its neighbors (except the one that sends this message) This is
the “when new, forward” principle On the contrary, if it is not the first time that p i receives the pair (id k , neighbors k ), it discards it This is the “when not new, discard”principle
When p i has received a pair (id k , neighbors k ), we say that it “knows the
po-sition” of p k in the graph This means that it knows both the identity id k and the
channels connecting p k to its neighbors
Trang 35(1) for each id j ∈ neighborsi
(2) dosend POSITION( id i , neighbors i )tothe neighbor identified id j
(3) end for;
(4) part i ← true
end operation.
whenSTART()is received do
(5) if ( ¬parti )thenstart()end if.
whenPOSITION( id, neighbors) is received from neighbor identified id xdo
(6) if ( ¬parti )thenstart()end if;
(7) if (id / ∈ proc_knowni )then
(8) proc_known i ← proc_knowni ∪ {id};
(9) channels_known i ← channels_knowni ∪ {id, idk such that idk ∈ neighbors};
(10) for each id y ∈ neighborsi \ {idx}
(11) dosend POSITION( id, neighbors)tothe neighbor identified id y
(13) if ( ∀ idj , id k ∈ channels_knowni : {idj , id k } ⊆ proc_knowni )
(14) then p iknows the communication graph; return()
(16) end if.
Fig 1.3 Learning the communication graph (code for p i)
Local Representation of the Communication Graph The graph is locally
rep-resented at each process p i with two local variables
• The local variable proc_known i is a set that contains all the processes whose
position is known by p i Initially, proc_known i = {id i}
• The local variable channels_known i is a set that contains all the channels known
by p i Initially, channels_known i = {id i , id j such that id j ∈ neighbors i}
Hence, after a process has received a message containing the pair (id j , neighbors j ),
we have id j ∈ proc_known iand{id j , id k such that id k ∈ neighbors j } ⊆ channels_ known i
In addition to the local representation of the graph, p i has a local Boolean
vari-able part i , initialized to false, which is set to true when p i starts participating in thealgorithm
Internal Versus External Messages This participation of a process starts when itreceives an external messageSTART()or an internal messagePOSITION()
An internal message is a message generated by the algorithm, while an external
message is a message coming from outside External messages are used to launchthe algorithm It is assumed that at least one process receives such a message
Algorithm: Forward/Discard The algorithm is described in Fig.1.3 As
previ-ously indicated, when a process p i receives a messageSTART()orPOSITION(), itstarts participating in the algorithm if not yet done (line5or6) To that end it sends
Trang 36the message POSITION( id i , neighbors i )to each of its neighbors (line 2) and sets
part i to true (line4)
When p i receives a messagePOSITION( id, neighbors) from one of its neighbors
p xfor the first time (line7), it includes the position of the corresponding process p j
in the local data structures proc_known i and channels_known i (lines8 9) and, as ithas learned something new, it forwards this messagePOSITION()to all its neighbors,but the one that sent it this message (line10) If it has already received the message
POSITION( id, neighbors) (we have then j ∈ proc_known i ), p idiscards the message
Algorithm: Termination As the communication graph is connected, it is easy tosee that, as soon as a process receives a message START() , each process p i willsend a messagePOSITION( id i , neighbors i )which, from neighbor to neighbor, will
be received by each process Consequently, for any pair of processes (p i , p j ) , p i
will receive a messagePOSITION( id j , neighbors j ), from which it follows that any
process p ieventually learns the communication graph
Moreover, as (a) there is a bounded number of processes n, (b) each process p iisthe only process to initiate the sending of the messagePOSITION( id i , neighbors i ),
and (c) any process p j forwards this message only once, it follows that there is afinite time after which no more message is sent Consequently, the algorithm termi-nates at each process While the algorithm always terminates, the important question
is the following: When does a process know that it can stop participating in the gorithm? Trivially, a process can stop when it knows that it has learned the wholecommunication graph (due to the “forward” strategy, when a process knows thewhole graph, it also knows that its neighbors eventually know it) This knowledge
al-can be easily captured by a process p i with the help of its local data structures
proc_known i and channels_known i More precisely, remembering that the pairs
id i , id j and id j , id i are synonyms and using a classical graph closure
prop-erty, a process p i knows the whole graph when ∀ id j , id k ∈ channels_known i :
{id j , id k } ⊆ proc_known i This local termination predicate appears at line13 When
it becomes locally satisfied, a process p i learns that it knows the whole graph andlearns also that its neighbors eventually know it That process can consequently stopits execution by invoking the statementreturn()line14
It is important to notice that the simplicity of the termination predicate
comes from an appropriate choice of the local data structures (proc_known i and
proc_known i) used to represent the communication graph
com-munication graph The diameter of a graph is the longest among all the shortest distances connecting any pair of processes, where the shortest distance between p i and p j is the smallest number of channels to go from p i to p j The diameter notion
is a global notion that measures the “breadth” of the communication graph
For any i and any channel, a messagePOSITION( id i , −) is sent at least once and
at most twice (once in each direction) on that channel It follows that the message
complexity is upper bounded by 2ne.
As far as the time complexity is concerned, let us consider that each messagetakes one time unit and local processing has zero duration In the worst case, a
Trang 37single process p k receives a messageSTART() and there is a process p at distance
D from p k In this case, it takes D time units for a message POSITION( id k , −)
to arrive at p This message wakes up p , and it takes then D time units for a
messagePOSITION( id , −) to arrive at p k It follows that the time complexity is
upper bounded by 2D.
max( {|neighbors i|1≤i≤n }), and b the number of bits required to encode any tity id i The maximal number of bits needed for a messagePOSITION() is b(d + 1).
iden-When Initially the Channels Have Only Local Names Let us consider a
pro-cess p i that has c i neighbors to which it is point-to-point connected by c i
chan-nels locally denoted channel i [1 c i ] When each process p i is initially given only
channel i [1 c i ], the processes can easily compute their sets neighbors i To that end,each process executes a preliminary communication phase during which it firstsends a messageID(i) on each channel i [x], 1 ≤ x ≤ c i, and then waits until it has
received the identities of the processes at the other end of its c i channels When
p i has received ID( id k ) on channel channel i [x], it can associate its local address channel i [x] with the identity id kwhose scope is the whole system
Port Name When each channel channel i [x] is defined by a local name, the index
x is sometimes called a port Hence, a process p i has c icommunication ports
1.2 Parallel Traversal: Broadcast and Convergecast
It is assumed that, while the identity of a process p i is its index i, no process knows explicitly the value of n (i.e., p n knows that its identity is n, but does not know that
its identity is also the number of processes)
1.2.1 Broadcast and Convergecast
Two frequent problems encountered in distributed computing are broadcast andconvergecast These two problems are defined with respect to a distinguished pro-
cess p a
• The broadcast problem is a one-to-many communication problem It consists in designing an algorithm that allows the distinguished process p a to disseminateinformation to the whole set of processes
A variant of the broadcast problem is the multicast problem In that case, the distinguished process p a has to disseminate information to a subset of the pro-cesses This subset can be statically defined or dynamically defined at the time ofthe multicast invocation
Trang 38whenGO( data) is received from p kdo
(1) if (first reception ofgo( data)) then
(2) for each j ∈ neighborsi \ {k} dosend GO( data)top j end for
(3) end if.
Fig 1.4 A simple flooding algorithm (code for p i)
• The convergecast problem is a many-to-one communication problem It consists
in designing an algorithm that allows each process p j to send information v j to
a distinguished process p a for it to compute some function f (), which is on a
vector[v1, , v n] containing one value per process
Broadcast and convergecast can be seen as dual communication operations They
are usually used as a pair: p a broadcasts a query to obtain values, one from each
process, from which it computes the resulting value f () As a simple example, p a
is a process that queries sensors for temperature values, from which it computesoutput values (e.g., maximal, minimal and average temperature values)
1.2.2 A Flooding Algorithm
A simple way to implement a broadcast consists of what is called a flooding
al-gorithm Such an algorithm is described in Fig.1.4 To simplify the description,
the distinguished process p a initially sends to itself a message denotedGO( data),
which carries the information it wants to disseminate Then, when a process p i ceives for the first time a copy of this message, it forwards the message to all itsneighbors (except to the sender of the message)
re-Each messageGO( data) can be identified by a sequence number sn a Moreover,the flooding algorithm can be easily adapted to work with any number of distin-guished processes by identifying each message broadcast by a distinguished process
p a with an identity pair (a, sn a )
As the set of processes is assumed to be connected, it is easy to see that the rithm described in Fig.1.4guarantees that the information sent by a distinguishedprocess is eventually received exactly once by each process
algo-1.2.3 Broadcast/Convergecast Based on a Rooted Spanning Tree
The previous flooding algorithm may use up to 2e − |neighbors a| messages (where
eis the number of channels), and is consequently not very efficient A simple way toimprove it consists of using an underlying spanning tree rooted at the distinguished
process p a
Trang 39Fig 1.5
A rooted spanning tree
Rooted Spanning Tree A spanning tree rooted at p a is a tree which contains n
processes and whose channels (edges) are channels of the communication graph
Each process p i has a single parent, locally denoted parent i, and a (possibly empty)
set of children, locally denoted children i To simplify the notation, the parent of the
root is the root itself, i.e., the distinguished process p a is the only process p i such
that parent i = i Moreover, if j = a, we have j ∈ children i ⇔ parent j = i, and the
channeli, j belongs to the communication graph.
An example of a rooted spanning tree is described in Fig.1.5 The arrows ented toward the root) describe the channels of the communication graph that belong
(ori-to the spanning tree The dotted edges are the channels of the communication graph
that do not belong to the spanning tree This spanning tree rooted at p a is such
that, when considering the position of process p i where neighbors i = {a, k, j, f },
we have parent i = a, children i = {j, k} and consequently parent j = parent k = i Moreover, children i ∪ {parent i } ⊆ neighbors i = {a, k, j, f }.
Algorithms Given such a rooted spanning tree, the algorithms implementing a
broadcast by p a and the associated convergecast to p a are described in Fig.1.6
As far as the broadcast is concerned, p afirst sends the messageGO( data) to itself,
and then this message is forwarded along the channels of the spanning tree, and thisrestricted flooding stops at the leaves of the tree
As far as the convergecast is concerned, each leaf p isends a messageBACK( val_ set i )to its parent (line4) where val_set i = {(i, v i )} (line2), i.e., val_set i contains a
============= Broadcast =============================
whenGO( data) is received from p kdo
(1) for each j ∈ childreni \ {k} dosend GO( data)top jend for.
============= Convergecast ===========================
whenBACK( val_set j ) received from each p j such that j ∈ childrenido
(2) val_set i ← (j ∈children i val_set j ) ∪ {(i, vi )};
(3) let k = parenti;
(4) if (k = i) thensend BACK( val_set i )top k
(5) else the root p i ( = pa ) can compute f (val_set i )
(6) end if.
Fig 1.6 Tree-based broadcast/convergecast (code for p i)
Trang 40single pair carrying the value v i sent by p i to the root A non-leaf process p i waits
for the pairs (k, v k ) from all its children, adds its own pair (i, v i ), and finally sends
the resulting set val_set i to its parent (line4) When the root has received a set ofpairs from each of its children, it has a pair from each process and can compute the
function f () (line5)
1.2.4 Building a Spanning Tree
This section presents a simple algorithm that (a) implements broadcast and
con-vergecast, and (b) builds a spanning tree This algorithm is sometimes called
prop-agation of information with feedback Once a spanning tree has been constructed,
it can be used for future broadcasts and convergecasts involving the same
distin-guished process p a
Local Variables As before, each process p i is provided with a set neighbors i
which defines its position in the communication graph and, at the end of the
execu-tion, its local variables parent i and children i will define its position in the spanning
tree rooted at p a
To compute its position in the spanning tree rooted at p a , each process p i uses
an auxiliary integer local variable denoted expected_msg i This variable contains
the number of messages that p i is waiting for from its children before sending amessageBACK()to its parent
Algorithm The broadcast/convergecast algorithm building a spanning tree is scribed in Fig.1.7 To simplify the presentation, it is first assumed that the channels
de-are FIFO (first in, first out) The distinguished process p ais the only process whichreceives the external messageSTART() (line1) Upon its reception, p a initializes
parent a , children a and expected_msg aand sends a messageGO( data) to each of its
neighbors (line2)
When a process p i receives a message GO( data) for the first time, it defines
the sender p j as its parent in the spanning tree, and initializes children i to∅ and
expected_msg i the number of its neighbors minus p j (line 4) If its parent is its
only neighbor, it sends back the pair (i, v i ) thereby indicating to p j that it is one
of its children (lines5 6) Otherwise, p i forwards the messageGO( data) to all its
neighbors but its parent p j (line7)
If parent i = ⊥, when p i receivesGO( data), it has already determined its parent
in the spanning tree and forwarded the messageGO( data) It consequently sends by
return to p j the messageBACK( ∅), where ∅ is used to indicate to p j that p i is notone of its children (line9)
When a process p i receives a messageBACK( res, val_set) from a neighbor p j, it
decreases expected_msg i(line11) and adds p j to children i if val_set= ∅ (line12)
Then, if p i has received a messageBACK()from all its neighbors (but its parent,line13), it sends to its parent (lines15–16) the set val_set containing its own pair
... i ceives for the first time a copy of this message, it forwards the message to all itsneighbors (except to the sender of the message)re-Each message< small>GO( data) can be identified... i This variable contains
the number of messages that p i is waiting for from its children before sending amessageBACK()to its parent
Algorithm... In this case, it takes D time units for a message< /i> POSITION( id k , −)
to arrive at p This message wakes up p