distributed algorithms for message passing systems raynal 2013 06 29 Cấu trúc dữ liệu và giải thuật

Causal message delivery and total order broadcast are first pre-sented in one chapter.. 6.1.2 A Distributed Execution Is a Partial Order on Local Events 1226.1.3 Causal Past, Causal Futu

Trang 1

Michel Raynal

Distributed

Algorithms for

Message-Passing Systems

Trang 2

for Message-Passing Systems

Trang 3

Distributed Algorithms

for Message-Passing Systems

Trang 4

Institut Universitaire de France

Springer Heidelberg New York Dordrecht London

Library of Congress Control Number: 2013942973

ACM Computing Classification (1998): F.1, D.1, B.3

This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law.

The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

While the advice and information in this book are believed to be true and accurate at the date of lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect

pub-to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media ( www.springer.com )

Trang 5

La profusion des choses cachait la rareté des idées et l’usure des croyances.

[ ] Retenir quelque chose du temps ó l’on ne sera plus.

In Les années (2008), Annie Ernaux

Nel mezzo del cammin di nostra vita

Mi ritrovai per una selva oscura, Ché la diritta via era smarritta.

In La divina commedia (1307–1321), Dante Alighieri (1265–1321)

Wir müssen nichts sein, sondern alles werden wollen.

Johann Wolfgang von Goethe (1749–1832)

Chaque génération, sans doute, se croit vouée à refaire le monde.

La mienne sait pourtant qu’elle ne le refera pas Mais sa tâche est peut-être plus grande.

Elle consiste à empêcher que le monde ne se défasse.

Speech at the Nobel Banquet, Stockholm, December 10, 1957, Albert Camus (1913–1960)

Rien n’est précaire comme vivre Rien comme être n’est passager C’est un peu fondre pour le givre

Ou pour le vent être léger J’arrive ó je suis étranger.

In Le voyage de Hollande (1965), Louis Aragon (1897–1982)

What Is Distributed Computing? Distributed computing was born in the late1970s when researchers and practitioners started taking into account the intrinsiccharacteristic of physically distributed systems The field then emerged as a special-ized research area distinct from networking, operating systems, and parallel com-puting

Distributed computing arises when one has to solve a problem in terms of

dis-tributed entities (usually called processors, nodes, processes, actors, agents, sors, peers, etc.) such that each entity has only a partial knowledge of the manyparameters involved in the problem that has to be solved While parallel computing

sen-and real-time computing can be characterized, respectively, by the terms efficiency and on-time computing, distributed computing can be characterized by the term un-

certainty This uncertainty is created by asynchrony, multiplicity of control flows,

Trang 6

absence of shared memory and global time, failure, dynamicity, mobility, etc tering one form or another of uncertainty is pervasive in all distributed computingproblems A main difficulty in designing distributed algorithms comes from the factthat each entity cooperating in the achievement of a common goal cannot have in-stantaneous knowledge of the current state of the other entities; it can only knowtheir past local states.

Mas-Although distributed algorithms are often made up of a few lines, their behaviorcan be difficult to understand and their properties hard to state and prove Hence,distributed computing is not only a fundamental topic but also a challenging topicwhere simplicity, elegance, and beauty are first-class citizens

Why This Book? While there are a lot of books on sequential computing (both onbasic data structures, or algorithms), this is not the case in distributed computing.Most books on distributed computing consider advanced topics where the uncer-tainty inherent to distributed computing is created by the net effect of asynchronyand failures It follows that these books are more appropriate for graduate studentsthan for undergraduate students

The aim of this book is to present in a comprehensive way basic notions, conceptsand algorithms of distributed computing when the distributed entities cooperate bysending and receiving messages on top of an underlying network In this case, themain difficulty comes from the physical distribution of the entities and the asyn-chrony of the environment in which they evolve

Audience This book has been written primarily for people who are not familiarwith the topic and the concepts that are presented These include mainly:

• Senior-level undergraduate students and graduate students in computer science

or computer engineering, who are interested in the principles and foundations ofdistributed computing

• Practitioners and engineers who want to be aware of the state-of-the-art concepts,basic principles, mechanisms, and techniques encountered in distributed comput-ing

Prerequisites for this book include undergraduate courses on algorithms, and sic knowledge on operating systems Selections of chapters for undergraduate andgraduate courses are suggested in the section titled “How to Use This Book” in the

ba-Afterword

Content As already indicated, this book covers algorithms, basic principles, andfoundations of message-passing programming, i.e., programs where the entitiescommunicate by sending and receiving messages through a network The world isdistributed, and the algorithmic thinking suited to distributed applications and sys-tems is not reducible to sequential computing Knowledge of the bases of distributedcomputing is becoming more important than ever as more and more computer ap-plications are now distributed The book is composed of six parts

Trang 7

• The aim of the first part, which is made up of six chapters, is to give a feel for thenature of distributed algorithms, i.e., what makes them different from sequential

or parallel algorithms To that end, it mainly considers distributed graph rithms In this context, each node of the graph is a process, which has to compute

algo-a result whose mealgo-aning depends on the whole gralgo-aph

Basic distributed algorithms such as network traversals, shortest-path rithms, vertex coloring, knot detection, etc., are first presented Then, a generalframework for distributed graph algorithms is introduced A chapter is devoted toleader election algorithms on a ring network, and another chapter focuses on thenavigation of a network by mobile objects

algo-• The second part is on the nature of distributed executions It is made up of fourchapters In some sense, this part is the core of the book It explains what a dis-tributed execution is, the fundamental notion of a consistent global state, and theimpossibility—without freezing the computation—of knowing whether a com-puted consistent global state has been passed through by the execution or not.Then, this part of the book addresses an important issue of distributed compu-tations, namely the notion of logical time: scalar (linear) time, vector time, andmatrix time Each type of time is analyzed and examples of their uses are given

A chapter, which extends the notion of a global state, is then devoted to chronous distributed checkpointing Finally, the last chapter of this part showshow to simulate a synchronous system on top of an asynchronous system (suchsimulators are called synchronizers)

asyn-• The third part of the book is made up of two chapters devoted to distributedmutual exclusion and distributed resource allocation Different families ofpermission-based mutual exclusion algorithms are presented The notion of anadaptive algorithm is also introduced The notion of a critical section with mul-tiple entries, and the case of resources with a single or several instances is alsopresented Associated deadlock prevention techniques are introduced

• The fourth part of the book is on the definition and the implementation of nication operations whose abstraction level is higher than the simple send/receive

commu-of messages These communication abstractions impose order constraints on sage deliveries Causal message delivery and total order broadcast are first pre-sented in one chapter Then, another chapter considers synchronous communica-tion (also called rendezvous or logically instantaneous communication)

mes-• The fifth part of the book, which is made up of two chapters, is on the detection

of stable properties encountered in distributed computing A stable property is aproperty that, once true, remains true forever The properties which are studied arethe detection of the termination of a distributed computation, and the detection ofdistributed deadlock This part of the book is strongly related to the second part(which is devoted to the notion of a global state)

• The sixth and last part of the book, which is also made up of two chapters, isdevoted to the notion of a distributed shared memory The aim is here to pro-vide the entities (processes) with a set of objects that allow them to cooperate at

Trang 8

an abstraction level more appropriate than the use of messages Two consistencyconditions, which can be associated with these objects, are presented and inves-tigated, namely, atomicity (also called linearizability) and sequential consistency.Several algorithms implementing these consistency conditions are described.

To have a more complete feeling of the spirit of this book, the reader is invited

to consult the section “The Aim of This Book” in theAfterword, which describeswhat it is hoped has been learned from this book Each chapter starts with a shortpresentation and a list of the main keywords, and terminates with a summary of itscontent Each of the six parts of the book is also introduced by a brief description ofits aim and its technical content

Acknowledgments This book originates from lecture notes for undergraduate and graduatecourses on distributed computing that I give at the University of Rennes (France) and, as aninvited professor, at several universities all over the world I would like to thank the studentsfor their questions that, in one way or another, have contributed to this book I want also tothank Ronan Nugent (Springer) for his support and his help in putting it all together.Last but not least (and maybe most importantly), I also want to thank all the researcherswhose results are presented in this book Without their work, this book would not exist

Michel RaynalProfesseur des UniversitésInstitut Universitaire de FranceIRISA-ISTIC, Université de Rennes 1Campus de Beaulieu, 35042, Rennes, France

March–October 2012Rennes, Saint-Grégoire, Tokyo, Fukuoka (AINA’12), Arequipa (LATIN’12),Reykjavik (SIROCCO’12), Palermo (CISIS’12), Madeira (PODC’12), Lisbon,

Douelle, Saint-Philibert, Rhodes Island (Europar’12),Salvador de Bahia (DISC’12), Mexico City (Turing Year at UNAM)

Trang 9

Part I Distributed Graph Algorithms

1 Basic Definitions and Network Traversal Algorithms 3

1.1 Distributed Algorithms 3

1.1.1 Definition 3

1.1.2 An Introductory Example: Learning the Communication Graph 6

1.2 Parallel Traversal: Broadcast and Convergecast 9

1.2.1 Broadcast and Convergecast 9

1.2.2 A Flooding Algorithm 10

1.2.3 Broadcast/Convergecast Based on a Rooted Spanning Tree 10 1.2.4 Building a Spanning Tree 12

1.3 Breadth-First Spanning Tree 16

1.3.1 Breadth-First Spanning Tree Built Without Centralized Control 17

1.3.2 Breadth-First Spanning Tree Built with Centralized Control 20 1.4 Depth-First Traversal 24

1.4.1 A Simple Algorithm 24

1.4.2 Application: Construction of a Logical Ring 27

1.5 Summary 32

1.6 Bibliographic Notes 32

1.7 Exercises and Problems 33

2 Distributed Graph Algorithms 35

2.1 Distributed Shortest Path Algorithms 35

2.1.1 A Distributed Adaptation of Bellman–Ford’s Shortest Path Algorithm 35

2.1.2 A Distributed Adaptation of Floyd–Warshall’s Shortest Paths Algorithm 38

2.2 Vertex Coloring and Maximal Independent Set 42

2.2.1 On Sequential Vertex Coloring 42

Trang 10

2.2.2 Distributed ( + 1)-Coloring of Processes 43

2.2.3 Computing a Maximal Independent Set 46

2.3 Knot and Cycle Detection 50

2.3.1 Directed Graph, Knot, and Cycle 50

2.3.2 Communication Graph, Logical Directed Graph, and Reachability 51

2.3.3 Specification of the Knot Detection Problem 51

2.3.4 Principle of the Knot/Cycle Detection Algorithm 52

2.3.5 Local Variables 53

2.3.6 Behavior of a Process 54

2.4 Summary 57

3 An Algorithmic Framework to Compute Global Functions on a Process Graph 59

3.1 Distributed Computation of Global Functions 59

3.1.1 Type of Global Functions 59

3.1.2 Constraints on the Computation 60

3.2 An Algorithmic Framework 61

3.2.1 A Round-Based Framework 61

3.2.2 When the Diameter Is Not Known 64

3.3 Distributed Determination of Cut Vertices 66

3.3.1 Cut Vertices 66

3.3.2 An Algorithm Determining Cut Vertices 67

3.4 Improving the Framework 69

3.4.1 Two Types of Filtering 69

3.4.2 An Improved Algorithm 70

3.5 The Case of Regular Communication Graphs 72

3.5.1 Tradeoff Between Graph Topology and Number of Rounds 72 3.5.2 De Bruijn Graphs 73

3.6 Summary 75

3.8 Problem 76

4 Leader Election Algorithms 77

4.1 The Leader Election Problem 77

4.1.1 Problem Definition 77

4.1.2 Anonymous Systems: An Impossibility Result 78

4.1.3 Basic Assumptions and Principles of the Election Algorithms 79

4.2 A Simple O(n2)Leader Election Algorithm for Unidirectional Rings 79

4.2.1 Context and Principle 79

4.2.2 The Algorithm 80

4.2.3 Time Cost of the Algorithm 80

Trang 11

4.2.4 Message Cost of the Algorithm 81

4.2.5 A Simple Variant 82

4.3 An O(n log n) Leader Election Algorithm for Bidirectional Rings 83 4.3.1 Context and Principle 83

4.3.3 Time and Message Complexities 85

4.4 An O(n log n) Election Algorithm for Unidirectional Rings 86

4.4.1 Context and Principles 86

4.4.3 Discussion: Message Complexity and FIFO Channels 89

4.5 Two Particular Cases 89

4.6 Summary 90

5 Mobile Objects Navigating a Network 93

5.1 Mobile Object in a Process Graph 93

5.1.2 Mobile Object Versus Mutual Exclusion 94

5.1.3 A Centralized (Home-Based) Algorithm 94

5.1.4 The Algorithms Presented in This Chapter 95

5.2 A Navigation Algorithm for a Complete Network 96

5.2.1 Underlying Principles 96

5.3 A Navigation Algorithm Based on a Spanning Tree 100

5.3.1 Principles of the Algorithm: Tree Invariant and Proxy Behavior 101

5.3.3 Discussion and Properties 104

5.3.4 Proof of the Algorithm 106

5.4 An Adaptive Navigation Algorithm 108

5.4.1 The Adaptivity Property 109

5.4.2 Principle of the Implementation 109

5.4.3 An Adaptive Algorithm Based on a Distributed Queue 111

5.4.4 Properties 113

5.4.5 Example of an Execution 114

5.5 Summary 115

Part II Logical Time and Global States in Distributed Systems 6 Nature of Distributed Computations and the Concept of a Global State 121

6.1 A Distributed Execution Is a Partial Order on Local Events 122

6.1.1 Basic Definitions 122

Trang 12

6.1.2 A Distributed Execution Is a Partial Order on Local Events 122

6.1.3 Causal Past, Causal Future, Concurrency, Cut 123

6.1.4 Asynchronous Distributed Execution with Respect to Physical Time 125

6.2 A Distributed Execution Is a Partial Order on Local States 127

6.3 Global State and Lattice of Global States 129

6.3.1 The Concept of a Global State 129

6.3.2 Lattice of Global States 129

6.3.3 Sequential Observations 131

6.4 Global States Including Process States and Channel States 132

6.4.1 Global State Including Channel States 132

6.4.2 Consistent Global State Including Channel States 133

6.4.3 Consistent Global State Versus Consistent Cut 134

6.5 On-the-Fly Computation of Global States 135

6.5.1 Global State Computation Is an Observation Problem 135

6.5.3 On the Meaning of the Computed Global State 136

6.5.4 Principles of Algorithms Computing a Global State 137

6.6 A Global State Algorithm Suited to FIFO Channels 138

6.6.1 Principle of the Algorithm 138

6.6.3 Example of an Execution 141

6.7 A Global State Algorithm Suited to Non-FIFO Channels 143

6.7.1 The Algorithm and Its Principles 144

6.7.2 How to Compute the State of the Channels 144

6.8 Summary 146

7 Logical Time in Asynchronous Distributed Systems 149

7.1 Linear Time 149

7.1.1 Scalar (or Linear) Time 150

7.1.2 From Partial Order to Total Order: The Notion of a Timestamp 151

7.1.3 Relating Logical Time and Timestamps with Observations 152 7.1.4 Timestamps in Action: Total Order Broadcast 153

7.2 Vector Time 159

7.2.1 Vector Time and Vector Clocks 159

7.2.2 Vector Clock Properties 162

7.2.3 On the Development of Vector Time 163

7.2.4 Relating Vector Time and Global States 165

7.2.5 Vector Clocks in Action: On-the-Fly Determination of a Global State Property 166

7.2.6 Vector Clocks in Action: On-the-Fly Determination of the Immediate Predecessors 170 7.3 On the Size of Vector Clocks 173

Trang 13

7.3.1 A Lower Bound on the Size of Vector Clocks 174

7.3.2 An Efficient Implementation of Vector Clocks 176

7.3.3 k-Restricted Vector Clock 181

7.4 Matrix Time 182

7.4.1 Matrix Clock: Definition and Algorithm 182

7.4.2 A Variant of Matrix Time in Action: Discard Old Data 184

7.5 Summary 186

8 Asynchronous Distributed Checkpointing 189

8.1 Definitions and Main Theorem 189

8.1.1 Local and Global Checkpoints 189

8.1.2 Z-Dependency, Zigzag Paths, and Z-Cycles 190

8.1.3 The Main Theorem 192

8.2 Consistent Checkpointing Abstractions 196

8.2.1 Z-Cycle-Freedom 196

8.2.2 Rollback-Dependency Trackability 197

8.2.3 On Distributed Checkpointing Algorithms 198

8.3 Checkpointing Algorithms Ensuring Z-Cycle Prevention 199

8.3.1 An Operational Characterization of Z-Cycle-Freedom 199

8.3.2 A Property of a Particular Dating System 199

8.3.3 Two Simple Algorithms Ensuring Z-Cycle Prevention 201

8.3.4 On the Notion of an Optimal Algorithm for Z-Cycle Prevention 203

8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 203

8.4.1 Rollback-Dependency Trackability (RDT) 203

8.4.2 A Simple Brute Force RDT Checkpointing Algorithm 205

8.4.3 The Fixed Dependency After Send (FDAS) RDT Checkpointing Algorithm 206

8.4.4 Still Reducing the Number of Forced Local Checkpoints 207

8.5 Message Logging for Uncoordinated Checkpointing 211

8.5.1 Uncoordinated Checkpointing 211

8.5.2 To Log or Not to Log Messages on Stable Storage 211

8.5.3 A Recovery Algorithm 214

8.5.4 A Few Improvements 215

8.6 Summary 216

9 Simulating Synchrony on Top of Asynchronous Systems 219

9.1 Synchronous Systems, Asynchronous Systems, and Synchronizers 219 9.1.1 Synchronous Systems 219

9.1.2 Asynchronous Systems and Synchronizers 221

9.1.3 On the Efficiency Side 222

Trang 14

9.2 Basic Principle for a Synchronizer 223

9.2.1 The Main Problem to Solve 223

9.2.2 Principle of the Solutions 224

9.3 Basic Synchronizers: α and β 224

9.3.1 Synchronizer α 224

9.3.2 Synchronizer β 227

9.4 Advanced Synchronizers: γ and δ 230

9.4.1 Synchronizer γ 230

9.4.2 Synchronizer δ 234

9.5 The Case of Networks with Bounded Delays 236

9.5.1 Context and Hypotheses 236

9.5.2 The Problem to Solve 237

9.5.3 Synchronizer λ 238

9.5.4 Synchronizer μ 239

9.5.5 When the Local Physical Clocks Drift 240

9.6 Summary 242

Part III Mutual Exclusion and Resource Allocation 10 Permission-Based Mutual Exclusion Algorithms 247

10.1 The Mutual Exclusion Problem 247

10.1.1 Definition 247

10.1.2 Classes of Distributed Mutex Algorithms 248

10.2 A Simple Algorithm Based on Individual Permissions 249

10.2.1 Principle of the Algorithm 249

10.2.4 From Simple Mutex to Mutex on Classes of Operations 255

10.3 Adaptive Mutex Algorithms Based on Individual Permissions 256

10.3.1 The Notion of an Adaptive Algorithm 256

10.3.2 A Timestamp-Based Adaptive Algorithm 257

10.3.3 A Bounded Adaptive Algorithm 259

10.3.4 Proof of the Bounded Adaptive Mutex Algorithm 262

10.4 An Algorithm Based on Arbiter Permissions 264

10.4.1 Permissions Managed by Arbiters 264

10.4.2 Permissions Versus Quorums 265

10.4.3 Quorum Construction 266

10.4.4 An Adaptive Mutex Algorithm Based on Arbiter Permissions 268

10.5 Summary 273

Trang 15

11 Distributed Resource Allocation 277

11.1 A Single Resource with Several Instances 277

11.1.1 The k-out-of-M Problem 277

11.1.2 Mutual Exclusion with Multiple Entries: The 1-out-of-M Mutex Problem 278

11.1.3 An Algorithm for the k-out-of-M Mutex Problem 280

11.1.5 From Mutex Algorithms to k-out-of-M Algorithms 285

11.2 Several Resources with a Single Instance 285

11.2.1 Several Resources with a Single Instance 286

11.2.2 Incremental Requests for Single Instance Resources: Using a Total Order 287

11.2.3 Incremental Requests for Single Instance Resources: Reducing Process Waiting Chains 290

11.2.4 Simultaneous Requests for Single Instance Resources and Static Sessions 292

11.2.5 Simultaneous Requests for Single Instance Resources and Dynamic Sessions 293

11.3 Several Resources with Multiple Instances 295

11.4 Summary 297

Part IV High-Level Communication Abstractions 12 Order Constraints on Message Delivery 303

12.1 The Causal Message Delivery Abstraction 303

12.1.1 Definition of Causal Message Delivery 304

12.1.2 A Causality-Based Characterization of Causal Message Delivery 305

12.1.3 Causal Order with Respect to Other Message Ordering Constraints 306

12.2 A Basic Algorithm for Point-to-Point Causal Message Delivery 306

12.2.3 Reduce the Size of Control Information Carried by Messages 310

12.3 Causal Broadcast 313

12.3.1 Definition and a Simple Algorithm 313

12.3.2 The Notion of a Causal Barrier 315

12.3.3 Causal Broadcast with Bounded Lifetime Messages 317

12.4 The Total Order Broadcast Abstraction 320

12.4.1 Strong Total Order Versus Weak Total Order 320

12.4.2 An Algorithm Based on a Coordinator Process or a Circulating Token 322

Trang 16

12.4.3 An Inquiry-Based Algorithm 324

12.4.4 An Algorithm for Synchronous Systems 326

12.5 Playing with a Single Channel 328

12.5.1 Four Order Properties on a Channel 328

12.5.2 A General Algorithm Implementing These Properties 329

12.6 Summary 332

13 Rendezvous (Synchronous) Communication 335

13.1 The Synchronous Communication Abstraction 335

13.1.2 An Example of Use 337

13.1.3 A Message Pattern-Based Characterization 338

13.1.4 Types of Algorithms Implementing Synchronous Communications 341

13.2 Algorithms for Nondeterministic Planned Interactions 341

13.2.1 Deterministic and Nondeterministic Communication Contexts 341

13.2.2 An Asymmetric (Static) Client–Server Implementation 342

13.2.3 An Asymmetric Token-Based Implementation 345

13.3 An Algorithm for Nondeterministic Forced Interactions 350

13.3.1 Nondeterministic Forced Interactions 350

13.4 Rendezvous with Deadlines in Synchronous Systems 354

13.4.1 Synchronous Systems and Rendezvous with Deadline 354

13.4.2 Rendezvous with Deadline Between Two Processes 355

13.4.3 Introducing Nondeterministic Choice 358

13.4.4 n-Way Rendezvous with Deadline 360

13.5 Summary 361

Part V Detection of Properties on Distributed Executions 14 Distributed Termination Detection 367

14.1 The Distributed Termination Detection Problem 367

14.1.1 Process and Channel States 367

14.1.2 Termination Predicate 368

14.1.3 The Termination Detection Problem 369

14.1.4 Types and Structure of Termination Detection Algorithms 369 14.2 Termination Detection in the Asynchronous Atomic Model 370

14.2.1 The Atomic Model 370

Trang 17

14.2.2 The Four-Counter Algorithm 371

14.2.3 The Counting Vector Algorithm 373

14.2.4 The Four-Counter Algorithm vs the Counting Vector Algorithm 376

14.3 Termination Detection in Diffusing Computations 376

14.3.1 The Notion of a Diffusing Computation 376

14.3.2 A Detection Algorithm Suited to Diffusing Computations 377 14.4 A General Termination Detection Algorithm 378

14.4.1 Wave and Sequence of Waves 379

14.4.2 A Reasoned Construction 381

14.5 Termination Detection in a Very General Distributed Model 385

14.5.1 Model and Nondeterministic Atomic Receive Statement 385

14.5.2 The Predicate fulfilled() 387

14.5.3 Static vs Dynamic Termination: Definition 388

14.5.4 Detection of Static Termination 390

14.5.5 Detection of Dynamic Termination 393

14.6 Summary 396

15 Distributed Deadlock Detection 401

15.1 The Deadlock Detection Problem 401

15.1.1 Wait-For Graph (WFG) 401

15.1.2 AND and OR Models Associated with Deadlock 403

15.1.3 Deadlock in the AND Model 403

15.1.4 Deadlock in the OR Model 404

15.1.5 The Deadlock Detection Problem 404

15.1.6 Structure of Deadlock Detection Algorithms 405

15.2 Deadlock Detection in the One-at-a-Time Model 405

15.2.1 Principle and Local Variables 406

15.2.2 A Detection Algorithm 406

15.3 Deadlock Detection in the AND Communication Model 408

15.3.1 Model and Principle of the Algorithm 409

15.4 Deadlock Detection in the OR Communication Model 413

15.4.1 Principle 413

15.5 Summary 421

Trang 18

Part VI Distributed Shared Memory

16 Atomic Consistency (Linearizability) 427

16.1 The Concept of a Distributed Shared Memory 427

16.2 The Atomicity Consistency Condition 429

16.2.1 What Is the Issue? 429

16.2.2 An Execution Is a Partial Order on Operations 429

16.2.3 Atomicity: Formal Definition 430

16.3 Atomic Objects Compose for Free 432

16.4 Message-Passing Implementations of Atomicity 435

16.4.1 Atomicity Based on a Total Order Broadcast Abstraction 435

16.4.2 Atomicity of Read/Write Objects Based on Server Processes 437

16.4.3 Atomicity Based on a Server Process and Copy Invalidation 438

16.4.4 Introducing the Notion of an Owner Process 439

16.4.5 Atomicity Based on a Server Process and Copy Update 443

16.5 Summary 444

17 Sequential Consistency 447

17.1 Sequential Consistency 447

17.1.2 Sequential Consistency Is Not a Local Property 449

17.1.3 Partial Order for Sequential Consistency 450

17.1.4 Two Theorems for Sequentially Consistent Read/Write Registers 451

17.1.5 From Theorems to Algorithms 453

17.2 Sequential Consistency from Total Order Broadcast 453

17.2.1 A Fast Read Algorithm for Read/Write Objects 453

17.2.2 A Fast Write Algorithm for Read/Write Objects 455

17.2.3 A Fast Enqueue Algorithm for Queue Objects 456

17.3 Sequential Consistency from a Single Server 456

17.3.1 The Single Server Is a Process 456

17.3.2 The Single Server Is a Navigating Token 459

17.4 Sequential Consistency with a Server per Object 460

17.4.1 Structural View 460

17.4.2 The Object Managers Must Cooperate 461

17.4.3 An Algorithm Based on the OO Constraint 462

17.5 A Weaker Consistency Condition: Causal Consistency 464

17.5.3 The Case of a Single Object 467

17.6 A Hierarchy of Consistency Conditions 468

Trang 19

17.7 Summary 468

Afterword 471

The Aim of This Book 471

Most Important Concepts, Notions, and Mechanisms Presented in This Book 471

How to Use This Book 473

From Failure-Free Systems to Failure-Prone Systems 474

A Series of Books 474

References 477

Index 495

Trang 20

no-op no operation

a, b pair with two elements a and b

m1; ; m q sequence of messages

a i [1 s] array of size s (local to process p i)

for each i ∈ {1, , m} order irrelevant

for each i from 1 to m order relevant

return(v) returns v and terminates the operation invocation

¬(a R b) relation R does not include the pair a, b

Trang 21

Fig 1.1 Three graph types of particular interest 4

Fig 1.2 Synchronous execution (left) vs asynchronous (right) execution 5 Fig 1.3 Learning the communication graph (code for p i) 7

Fig 1.4 A simple flooding algorithm (code for p i) 10

Fig 1.5 A rooted spanning tree 11

Fig 1.6 Tree-based broadcast/convergecast (code for p i) 11

Fig 1.7 Construction of a rooted spanning tree (code for p i) 13

Fig 1.8 Left: Underlying communication graph; Right: Spanning tree 14

Fig 1.9 An execution of the algorithm constructing a spanning tree 14

Fig 1.10 Two different spanning trees built from the same communication graph 16

Fig 1.11 Construction of a breadth-first spanning tree without centralized control (code for p i) 18

Fig 1.12 An execution of the algorithm of Fig.1.11 19

Fig 1.13 Successive waves launched by the root process p a 21

Fig 1.14 Construction of a breadth-first spanning tree with centralized control (starting code) 22

Fig 1.15 Construction of a breadth-first spanning tree with centralized control (code for a process p i) 22

Fig 1.16 Depth-first traversal of a communication graph (code for p i) 25

Fig 1.17 Time and message optimal depth-first traversal (code for p i) 27

Fig 1.18 Management of the token at process p i 29

Fig 1.19 From a depth-first traversal to a ring (code for p i) 29

Fig 1.20 Sense of direction of the ring and computation of routing tables 30 Fig 1.21 An example of a logical ring construction 31

Fig 1.22 An anonymous network 34

Fig 2.1 Bellman–Ford’s dynamic programming principle 36

Fig 2.2 A distributed adaptation of Bellman–Ford’s shortest path algorithm (code for p i) 37

Fig 2.3 A distributed synchronous shortest path algorithm (code for p i) 38 Fig 2.4 Floyd–Warshall’s sequential shortest path algorithm 39

Trang 22

Fig 2.5 The principle that underlies Floyd–Warshall’s shortest paths

algorithm 39

Fig 2.7 Sequential ( + 1)-coloring of the vertices of a graph 42Fig 2.8 Distributed ( + 1)-coloring from an initial m-coloring where

n ≥ m ≥ + 2 43

Fig 2.10 Examples of maximal independent sets 46Fig 2.11 From m-coloring to a maximal independent set (code for p i) 47

independent set (code for p i) 48

Fig 2.14 A directed graph with a knot 51Fig 2.15 Possible message pattern during a knot detection 53Fig 2.16 Asynchronous knot detection (code of p i) 55Fig 2.17 Knot/cycle detection: example 57

(code for p i) 63Fig 3.2 A diameter-independent generic algorithm (code for p i) 65Fig 3.3 A process graph with three cut vertices 66Fig 3.4 Determining cut vertices: principle 67Fig 3.5 An algorithm determining the cut vertices (code for p i) 68Fig 3.6 A general algorithm with filtering (code for p i) 71

(code for p i) 75Fig 4.1 Chang and Robert’s election algorithm (code for p i) 80Fig 4.2 Worst identity distribution for message complexity 81Fig 4.3 A variant of Chang and Robert’s election algorithm (code for p i) 83

Fig 4.5 Competitors at the end of round r are at distance greater than 2 r 84Fig 4.6 Hirschberg and Sinclair’s election algorithm (code for p i) 85Fig 4.7 Neighbor processes on the unidirectional ring 87Fig 4.8 From the first to the second round 87Fig 4.9 Dolev, Klawe, and Rodeh’s election algorithm (code for p i) 88Fig 4.10 Index-based randomized election (code for p i) 90

Fig 5.2 Structural view of the navigation algorithm

(module at process p i) 98Fig 5.3 A navigation algorithm for a complete network (code for p i) 99

Fig 5.5 Navigation tree: initial state 101Fig 5.6 Navigation tree: after the object has moved to p c 102Fig 5.7 Navigation tree: proxy role of a process 102

Trang 23

Fig 5.8 A spanning tree-based navigation algorithm (code for p i) 104Fig 5.9 The case of non-FIFO channels 105

R = [d(i1), d(i2), , d(i x−1), d(i x ), 0, , 0] 108Fig 5.11 A dynamically evolving spanning tree 110Fig 5.12 A navigation algorithm based on a distributed queue

(code for p i) 112Fig 5.13 From the worst to the best case 113Fig 5.14 Example of an execution 114Fig 5.15 A hybrid navigation algorithm (code for p i) 117Fig 6.1 A distributed execution as a partial order 124Fig 6.2 Past, future, and concurrency sets associated with an event 125Fig 6.3 Cut and consistent cut 126Fig 6.4 Two instances of the same execution 126Fig 6.5 Consecutive local states of a process p i 127Fig 6.6 From a relation on events to a relation on local states 128Fig 6.7 A two-process distributed execution 130Fig 6.8 Lattice of consistent global states 130Fig 6.9 Sequential observations of a distributed computation 131Fig 6.10 Illustrating the notations “e ∈ σ i ” and “f ∈ σ i” 133Fig 6.11 In-transit and orphan messages 133Fig 6.12 Cut versus global state 135Fig 6.13 Global state computation: structural view 136Fig 6.14 Recording of a local state 139Fig 6.15 Reception of aMARKER()message: case 1 139Fig 6.16 Reception of aMARKER()message: case 2 139Fig 6.17 Global state computation (FIFO channels, code for cp i) 140Fig 6.18 A simple automaton for process p i (i = 1, 2) 141Fig 6.19 Prefix of a simple execution 142

on a distributed execution 142Fig 6.21 Consistent cut associated with the computed global state 143Fig 6.22 A rubber band transformation 143Fig 6.23 Global state computation (non-FIFO channels, code for cp i) 145Fig 6.24 Example of a global state computation (non-FIFO channels) 145

(non-FIFO channels, code for cp i) 148Fig 7.1 Implementation of a linear clock (code for process p i) 150Fig 7.2 A simple example of a linear clock system 151Fig 7.3 A non-sequential observation obtained from linear time 152

Fig 7.5 Total order broadcast: the problem that has to be solved 155Fig 7.6 Structure of the total order broadcast implementation 155Fig 7.7 Implementation of total order broadcast (code for process p i) 157Fig 7.8 To_delivery predicate of a message at process p i 157

Trang 24

Fig 7.9 Implementation of a vector clock system (code for process p i) 160Fig 7.10 Time propagation in a vector clock system 161Fig 7.11 On the development of time (1) 164Fig 7.12 On the development of time (2) 164Fig 7.13 Associating vector dates with global states 165Fig 7.14 First global state satisfying a global predicate (1) 167Fig 7.15 First global state satisfying a global predicate (2) 168Fig 7.16 Detection the first global state satisfying

i LP i (code for process p i) 169Fig 7.17 Relevant events in a distributed computation 171Fig 7.18 Vector clock system for relevant events (code for process p i) 171Fig 7.19 From relevant events to Hasse diagram 171

(code for process p i) 172Fig 7.21 Four possible cases when updating imp i [k],

while vc i [k] = vc[k] 173Fig 7.22 A specific communication pattern 175Fig 7.23 Specific communication pattern with n= 3 processes 175Fig 7.24 Management of vc i [1 n] and kprime i [1 n, 1 n]

(code for process p i): Algorithm 1 178Fig 7.25 Management of vc i [1 n] and kprime i [1 n, 1 n]

(code for process p i): Algorithm 2 179Fig 7.26 An adaptive communication layer (code for process p i) 181Fig 7.27 Implementation of a k-restricted vector clock system

(code for process p i) 182Fig 7.28 Matrix time: an example 183Fig 7.29 Implementation of matrix time (code for process p i) 184Fig 7.30 Discarding obsolete data: structural view (at a process p i) 185Fig 7.31 A buffer management algorithm (code for process p i) 185Fig 7.32 Yet another clock system (code for process p i) 188

Fig 8.2 A zigzag pattern 192

a zigzag path joining two local checkpoints of LC 194

a zigzag path joining two local checkpoints 195Fig 8.5 Domino effect (in a system of two processes) 196Fig 8.6 Proof by contradiction of Theorem11 200

(code for p i) 201Fig 8.8 To take or not to take a forced local checkpoint 202Fig 8.9 An example of z-cycle prevention 202Fig 8.10 A vector clock system for rollback-dependency trackability

(code for p i) 204Fig 8.11 Intervals and vector clocks for rollback-dependency trackability 204

Trang 25

Fig 8.12 Russell’s pattern for ensuring the RDT consistency condition 205Fig 8.13 Russell’s checkpointing algorithm (code for p i) 205Fig 8.14 FDAS checkpointing algorithm (code for p i) 207Fig 8.15 Matrix causal i [1 n, 1 n] 208Fig 8.16 Pure (left) vs impure (right) causal paths from p j to p i 208Fig 8.17 An impure causal path from p i to itself 209Fig 8.18 An efficient checkpointing algorithm for RDT (code for p i) 210Fig 8.19 Sender-based optimistic message logging 212Fig 8.20 To log or not to log a message? 212Fig 8.21 An uncoordinated checkpointing algorithm (code for p i) 214Fig 8.22 Retrieving the messages which are in transit

with respect to the pair (c i , c j ) 215

Fig 9.2 Synchronous breadth-first traversal algorithm (code for p i) 221Fig 9.3 Synchronizer: from asynchrony to logical synchrony 222Fig 9.4 Synchronizer α (code for p i) 226Fig 9.5 Synchronizer α: possible message arrival at process p i 227Fig 9.6 Synchronizer β (code for p i) 229

(but not with α): Case 1 229

(but not with α): Case 2 229Fig 9.9 Synchronizer γ : a communication graph 230Fig 9.10 Synchronizer γ : a partitioning 231Fig 9.11 Synchronizer γ (code for p i) 233Fig 9.12 Synchronizer δ (code for p i) 235Fig 9.13 Initialization of physical clocks (code for p i) 236Fig 9.14 The scenario to be prevented 237Fig 9.15 Interval during which a process can receive pulse r messages 238Fig 9.16 Synchronizer λ (code for p i) 239Fig 9.17 Synchronizer μ (code for p i) 240Fig 9.18 Clock drift with respect to reference time 241Fig 10.1 A mutex invocation pattern and the three states of a process 248Fig 10.2 Mutex module at a process p i: structural view 250

(code for p i) 251Fig 10.4 Proof of the safety property of the algorithm of Fig.10.3 253Fig 10.5 Proof of the liveness property of the algorithm of Fig.10.3 254

(code for p i) 256

(code for p i) 258Fig 10.8 Non-FIFO channel in the algorithm of Fig.10.7 259Fig 10.9 States of the messagePERMISSION( {i, j}) 260

Trang 26

Fig 10.10 A bounded adaptive algorithm based on individual permissions

(code for p i) 261Fig 10.11 Arbiter permission-based mechanism 265

Fig 10.13 An order two projective plane 267Fig 10.14 A safe (but not live) mutex algorithm

based on arbiter permissions (code for p i) 269Fig 10.15 Permission preemption to prevent deadlock 270Fig 10.16 A mutex algorithm based on arbiter permissions (code for p i) 272Fig 11.1 An algorithm for the multiple entries mutex problem

(code for p i) 279Fig 11.2 Sending pattern ofNOT_USED()messages: Case 1 281Fig 11.3 Sending pattern ofNOT_USED()messages: Case 2 282Fig 11.4 An algorithm for the k-out-of-M mutex problem (code for p i) 282Fig 11.5 Examples of conflict graphs 286Fig 11.6 Global conflict graph 287Fig 11.7 A deadlock scenario involving two processes and two resources 288Fig 11.8 No deadlock with ordered resources 289Fig 11.9 A particular pattern in using resources 289Fig 11.10 Conflict graph for six processes, each resource being shared

by two processes 290Fig 11.11 Optimal vertex-coloring of a resource graph 291Fig 11.12 Conflict graph for static sessions (SS_CG) 292Fig 11.13 Simultaneous requests in dynamic sessions

(sketch of code for p i) 296Fig 11.14 Algorithms for generalized k-out-of-M (code for p i) 297

(code for p i) 299Fig 12.1 The causal message delivery order property 304Fig 12.2 The delivery pattern prevented by the empty interval property 305Fig 12.3 Structure of a causal message delivery implementation 307Fig 12.4 An implementation of causal message delivery (code for p i) 308Fig 12.5 Message pattern for the proof of the causal order delivery 309Fig 12.6 An implementation reducing the size of control information

(code for p i) 312Fig 12.7 Control information carried by consecutive messages sent by p j

to p i 312Fig 12.8 An adaptive sending procedure for causal message delivery 313Fig 12.9 Illustration of causal broadcast 313Fig 12.10 A simple algorithm for causal broadcast (code for p i) 314Fig 12.11 The causal broadcast algorithm in action 315Fig 12.12 The graph of immediate predecessor messages 316Fig 12.13 A causal broadcast algorithm based on causal barriers

(code for p i) 317Fig 12.14 Message with bounded lifetime 318

Trang 27

Fig 12.15 On-time versus too late 319Fig 12.16 A -causal broadcast algorithm (code for p i) 320Fig 12.17 Implementation of total order message delivery requires

coordination 321Fig 12.18 Total order broadcast based on a coordinator process 322Fig 12.19 Token-based total order broadcast 323Fig 12.20 Clients and servers in total order broadcast 324Fig 12.21 A total order algorithm from clients p i to servers q j 326Fig 12.22 A total order algorithm from synchronous systems 327

(cannot bypass other messages) 329Fig 12.25 Message m with typemarker 329Fig 12.26 Building a first in first out channel 330Fig 12.27 Message delivery according to message types 331

messages as “points” instead of “intervals” 336

Fig 13.4 A crown of size k = 2 (left) and a crown of size k = 3 (right) 339Fig 13.5 Four message patterns 340Fig 13.6 Implementation of a rendezvous when the client is the sender 344Fig 13.7 Implementation of a rendezvous when the client is the receiver 345Fig 13.8 A token-based mechanism to implement an interaction 346Fig 13.9 Deadlock and livelock prevention in interaction implementation 347Fig 13.10 A general token-based implementation for planned interactions

(rendezvous) 348Fig 13.11 An algorithm for forced interactions (rendezvous) 351Fig 13.12 Forced interaction: message pattern when i > j 352Fig 13.13 Forced interaction: message pattern when i < j 352

(two-process symmetric algorithm) 356Fig 13.15 Real-time rendezvous between two processes p and q 357

(asymmetric algorithm) 358Fig 13.17 Nondeterministic rendezvous with deadline 359Fig 13.18 Multirendezvous with deadline 361Fig 13.19 Comparing two date patterns for rendezvous with deadline 363Fig 14.1 Process states for termination detection 368Fig 14.2 Global structure of the observation modules 370Fig 14.3 An execution in the asynchronous atomic model 371Fig 14.4 One visit is not sufficient 371Fig 14.5 The four-counter algorithm for termination detection 372Fig 14.6 Two consecutive inquiries 373

Trang 28

Fig 14.7 The counting vector algorithm for termination detection 374Fig 14.8 The counting vector algorithm at work 375Fig 14.9 Termination detection of a diffusing computation 378Fig 14.10 Ring-based implementation of a wave 380Fig 14.11 Spanning tree-based implementation of a wave 381

1≤i≤n idle x i ) ⇒ TERM(C, τ x )is not true 382Fig 14.13 A general algorithm for termination detection 384Fig 14.14 Atomicity associated with τ i x 384Fig 14.15 Structure of the channels to p i 386Fig 14.16 An algorithm for static termination detection 391Fig 14.17 Definition of time instants for the safety of static termination 392Fig 14.18 Cooperation between local observers 394Fig 14.19 An algorithm for dynamic termination detection 395Fig 14.20 Example of a monotonous distributed computation 398Fig 15.1 Examples of wait-for graphs 402

in the AND communication model 410Fig 15.3 Determining in-transit messages 411

(with no application messages in transit) 411Fig 15.5 Time instants in the proof of the safety property 412Fig 15.6 A directed communication graph 414Fig 15.7 Network traversal with feedback on a static graph 414Fig 15.8 Modification in a wait-for graph 415Fig 15.9 Inconsistent observation of a dynamic wait-for graph 416Fig 15.10 An algorithm for deadlock detection

in the OR communication model 418Fig 15.11 Activation pattern for the safety proof 420Fig 15.12 Another example of a wait-for graph 423Fig 16.1 Structure of a distributed shared memory 428Fig 16.2 Register: What values can be returned by read operations? 429Fig 16.3 The relation−→ of the computation described in Fig.op 16.2 430Fig 16.4 An execution of an atomic register 432Fig 16.5 Another execution of an atomic register 432Fig 16.6 Atomicity allows objects to compose for free 435Fig 16.7 From total order broadcast to atomicity 436Fig 16.8 Why read operations have to be to-broadcast 437Fig 16.9 Invalidation-based implementation of atomicity:

message flow 438Fig 16.10 Invalidation-based implementation of atomicity: algorithm 440Fig 16.11 Invalidation and owner-based implementation of atomicity

(code of p i) 441Fig 16.12 Invalidation and owner-based implementation of atomicity

(code of the manager p X) 442Fig 16.13 Update-based implementation of atomicity 443

Trang 29

Fig 16.14 Update-based algorithm implementing atomicity 444Fig 17.1 A sequentially consistent computation (which is not atomic) 448Fig 17.2 A computation which is not sequentially consistent 449Fig 17.3 A sequentially consistent queue 449Fig 17.4 Sequential consistency is not a local property 450Fig 17.5 Part of the graph G used in the proof of Theorem29 452Fig 17.6 Fast read algorithm implementing sequential consistency

(code for p i) 454Fig 17.7 Fast write algorithm implementing sequential consistency

(code for p i) 456Fig 17.8 Fast enqueue algorithm implementing a sequentially consistent

queue (code for p i) 457Fig 17.9 Read/write sequentially consistent registers

from a central manager 458Fig 17.10 Pattern of read/write accesses used in the proof of Theorem33 459Fig 17.11 Token-based sequentially consistent shared memory

(code for p i) 460Fig 17.12 Architectural view associated with the OO constraint 461Fig 17.13 Why the object managers must cooperate 461Fig 17.14 Sequential consistency with a manager per object:

process side 462Fig 17.15 Sequential consistency with a manager per object:

manager side 463

Fig 17.17 An example of a causally consistent computation 465Fig 17.18 Another example of a causally consistent computation 466Fig 17.19 A simple algorithm implementing causal consistency 467Fig 17.20 Causal consistency for a single object 467Fig 17.21 Hierarchy of consistency conditions 469

Trang 30

Distributed Graph Algorithms

This first part of the book is on distributed graph algorithms These algorithms sider the distributed system as a connected graph whose vertices are the processes(nodes) and whose edges are the communication channels It is made up of fivechapters

con-After having introduced base definitions, Chap.1addresses network traversals

It presents distributed algorithms that realize parallel, depth-first, and breadth-firstnetwork traversals Chapter2 is on distributed algorithms solving classical graphproblems such as shortest paths, vertex coloring, maximal independent set, andknot detection This chapter shows that the distributed techniques to solve graphproblems are not obtained by a simple extension of their sequential counterparts.Chapter3presents a general technique to compute a global function on a processgraph, each process providing its own input parameter, and obtaining its own output(which depends on the whole set of inputs) Chapter4is on the leader election prob-lems with a strong emphasis on uni/bidirectional rings Finally, the last chapter ofthis part, Chap.5, presents several algorithms that allow a mobile object to navigate

a network

In addition to the presentation of distributed graph algorithms, which can be used

in distributed applications, an aim of this part of the book is to allow readers to have

a better intuition of the term distributed when comparing distributed algorithms and

sequential algorithms

Trang 31

Basic Definitions

and Network Traversal Algorithms

This chapter first introduces basic definitions related to distributed algorithms Then,considering a distributed system as a graph whose vertices are the processes andwhose edges are the communication channels, it presents distributed algorithms forgraph traversals, namely, parallel traversal, breadth-first traversal, and depth-firsttraversal It also shows how spanning trees or rings can be constructed from thesedistributed graph traversal algorithms These trees and rings can, in turn, be used toeasily implement broadcast and convergecast algorithms

As the reader will see, the distributed graph traversal techniques are differentfrom their sequential counterparts in their underlying principles, behaviors, andcomplexities This come from the fact that, in a distributed context, the same type oftraversal can usually be realized in distinct ways, each with its own tradeoff betweenits time complexity and message complexity

Keywords Asynchronous/synchronous system· Breadth-first traversal ·

Broadcast· Convergecast · Depth-first traversal · Distributed algorithm ·

Forward/discard principle· Initial knowledge · Local algorithm ·

Parallel traversal· Spanning tree · Unidirectional logical ring

1.1 Distributed Algorithms

1.1.1 Definition

Processes A distributed system is made up of a collection of computing units,

each one abstracted through the notion of a process The processes are assumed to

cooperate on a common goal, which means that they exchange information in oneway or another

The set of processes is static It is composed of n processes and denoted Π=

{p1, , p n }, where each p i, 1≤ i ≤ n, represents a distinct process Each process

p i is sequential, i.e., it executes one step at a time.

The integer i denotes the index of process p i, i.e., the way an external observer

can distinguish processes It is nearly always assumed that each process p i has its

own identity, denoted id i ; then p i knows id i (in a lot of cases—but not always—

id i = i).

Trang 32

Fig 1.1 Three graph types

of particular interest

Communication Medium The processes communicate by sending and receiving

messages through channels Each channel is assumed to be reliable (it does not

create, modify, or duplicate messages)

In some cases, we assume that channels are first in first out (FIFO) which means

that the messages are received in the order in which they have been sent Eachchannel is assumed to be bidirectional (can carry messages in both directions) and

to have an infinite capacity (can contain any number of messages, each of any size)

In some particular cases, we will consider channels which are unidirectional (suchchannels carry messages in one direction only)

Each process p i has a set of neighbors, denoted neighbors i According to the

context, this set contains either the local identities of the channels connecting p i toits neighbor processes or the identities of these processes

Structural View It follows from the previous definitions that, from a structuralpoint of view, a distributed system can be represented by a connected undirected

graph G = (Π, C) (where C denotes the set of channels) Three types of graph are

of particular interest (Fig.1.1):

• A ring is a graph in which each process has exactly two neighbors with which it

can communicate directly, a left neighbor and a right neighbor

• A tree is a graph that has two noteworthy properties: it is acyclic and connected

(which means that adding a new channel would create a cycle while suppressing

a channel would disconnect it)

• A fully connected graph is a graph in which each process is directly connected to

every other process (In graph terminology, such a graph is called a clique.)

Distributed Algorithm A distributed algorithm is a collection of n automata, one

per process An automaton describes the sequence of steps executed by the sponding process

corre-In addition to the power of a Turing machine, an automaton is enriched with twocommunication operations which allows it to send a message on a channel or receive

a message on any channel The operations aresend()andreceive()

Synchronous Algorithm A distributed synchronous algorithm is an algorithm

de-signed to be executed on a synchronous distributed system The progress of such asystem is governed by an external global clock, and the processes collectively exe-

cute a sequence of rounds, each round corresponding to a value of the global clock.

Trang 33

Fig 1.2 Synchronous execution (left) vs asynchronous (right) execution

During a round, a process sends at most one message to each of its neighbors The

fundamental property of a synchronous system is that a message sent by a process during a round r is received by its destination process during the very same round r Hence, when a process proceeds to the round r+ 1, it has received (and processed)

all the messages which have been sent to it during round r, and it knows that the

same is true for any process

Space/time Diagram A distributed execution can be graphically represented by

what is called a space/time diagram Each sequential progress is represented by

an arrow from left to right, and a message is represented by an arrow from thesending process to the destination process These notions will be made more precise

in Chap.6

The space/time diagram on the left of Fig.1.2represents a synchronous tion The vertical lines are used to separate the successive rounds During the first

execu-round, p1sends a message to p3, and p2sends a message to p1, etc

Asynchronous Algorithm A distributed asynchronous algorithm is an algorithm

designed to be executed on an asynchronous distributed system In such a system,there is no notion of an external time That is why asynchronous systems are some-

times called time-free systems.

In an asynchronous algorithm, the progress of a process is ensured by its owncomputation and the messages it receives When a process receives a message, itprocesses the message and, according to its local algorithm, possibly sends mes-sages to its neighbors

A process processes one message at a time This means that the processing of amessage cannot be interrupted by the arrival of another message When a messagearrives, it is added to the input buffer of the receiving process It will be processedafter all the messages that precede it in this buffer have been processed

The space/time diagram of a simple asynchronous execution is depicted on theright of Fig.1.2 One can see that, in this example, the messages from p1to p2

are not received in their sending order Hence, the channel from p1to p2is not aFIFO (first in first out) channel It is easy to see from the figure that a synchronousexecution is more structured than an asynchronous execution

Initial Knowledge of a Process When solving a problem in a chronous system, a process is characterized by its input parameters (which are re-

synchronous/asyn-lated to the problem to solve) and its initial knowledge of its environment.

Trang 34

This knowledge concerns its identity, the total number n of processes, the identity

of its neighbors, the structure of the communication graph, etc As an example, a

process p imay only know that

• it is on a unidirectional ring,

• it has a left neighbor from which it can receive messages,

• it has a right neighbor to which it can send messages,

• its identity is id i,

• the fact that no two processes have the same identity, and

• the fact that the set of identities is totally ordered

As we can see, with such an initial knowledge, no process initially knows the total

number of processes n Learning this number requires the processes to exchange

information

1.1.2 An Introductory Example:

Learning the Communication Graph

As a simple example, this section presents an asynchronous algorithm that allowseach process to learn the communication graph in which it evolves It is assumedthat the channels are bidirectional and that the communication graph is connected(there is a path from any process to any other process)

Initial Knowledge Each process p i has identity id i , and no process knows n (the total number of processes) Initially, a process p i knows its identity and the iden-

tity id j of each of its neighbors Hence, each process p i is initially provided with

a set neighbors i and, for each id j ∈ neighbors i, the pairid i , id j denotes locally

the channel connecting p i to p j Let us observe that, as the channels are tional, bothid i , id j and id j , id i denote the same channel and are consequentlyconsidered as synonyms

bidirec-The Forward/Discard Principle The principle on which the algorithm relies ispretty simple: Each process initially sends its position in the graph to each of its

neighbors This position is represented by the pair (id i , neighbors i )

Then, when a process p i receives a pair (id k , neighbors k )for the first time, it dates its local representation of the communication graph and forwards the message

up-it has received to all up-its neighbors (except the one that sends this message) This is

the “when new, forward” principle On the contrary, if it is not the first time that p i receives the pair (id k , neighbors k ), it discards it This is the “when not new, discard”principle

When p i has received a pair (id k , neighbors k ), we say that it “knows the

po-sition” of p k in the graph This means that it knows both the identity id k and the

channels connecting p k to its neighbors

Trang 35

(1) for each id j ∈ neighborsi

(2) dosend POSITION( id i , neighbors i )tothe neighbor identified id j

(3) end for;

(4) part i ← true

end operation.

whenSTART()is received do

(5) if ( ¬parti )thenstart()end if.

whenPOSITION( id, neighbors) is received from neighbor identified id xdo

(6) if ( ¬parti )thenstart()end if;

(7) if (id / ∈ proc_knowni )then

(8) proc_known i ← proc_knowni ∪ {id};

(9) channels_known i ← channels_knowni ∪ {id, idk such that idk ∈ neighbors};

(10) for each id y ∈ neighborsi \ {idx}

(11) dosend POSITION( id, neighbors)tothe neighbor identified id y

(13) if ( ∀ idj , id k ∈ channels_knowni : {idj , id k } ⊆ proc_knowni )

(14) then p iknows the communication graph; return()

(16) end if.

Fig 1.3 Learning the communication graph (code for p i)

Local Representation of the Communication Graph The graph is locally

rep-resented at each process p i with two local variables

• The local variable proc_known i is a set that contains all the processes whose

position is known by p i Initially, proc_known i = {id i}

• The local variable channels_known i is a set that contains all the channels known

by p i Initially, channels_known i = {id i , id j such that id j ∈ neighbors i}

Hence, after a process has received a message containing the pair (id j , neighbors j ),

we have id j ∈ proc_known iand{id j , id k such that id k ∈ neighbors j } ⊆ channels_ known i

In addition to the local representation of the graph, p i has a local Boolean

vari-able part i , initialized to false, which is set to true when p i starts participating in thealgorithm

Internal Versus External Messages This participation of a process starts when itreceives an external messageSTART()or an internal messagePOSITION()

An internal message is a message generated by the algorithm, while an external

message is a message coming from outside External messages are used to launchthe algorithm It is assumed that at least one process receives such a message

Algorithm: Forward/Discard The algorithm is described in Fig.1.3 As

previ-ously indicated, when a process p i receives a messageSTART()orPOSITION(), itstarts participating in the algorithm if not yet done (line5or6) To that end it sends

Trang 36

the message POSITION( id i , neighbors i )to each of its neighbors (line 2) and sets

part i to true (line4)

When p i receives a messagePOSITION( id, neighbors) from one of its neighbors

p xfor the first time (line7), it includes the position of the corresponding process p j

in the local data structures proc_known i and channels_known i (lines8 9) and, as ithas learned something new, it forwards this messagePOSITION()to all its neighbors,but the one that sent it this message (line10) If it has already received the message

POSITION( id, neighbors) (we have then j ∈ proc_known i ), p idiscards the message

Algorithm: Termination As the communication graph is connected, it is easy tosee that, as soon as a process receives a message START() , each process p i willsend a messagePOSITION( id i , neighbors i )which, from neighbor to neighbor, will

be received by each process Consequently, for any pair of processes (p i , p j ) , p i

will receive a messagePOSITION( id j , neighbors j ), from which it follows that any

process p ieventually learns the communication graph

Moreover, as (a) there is a bounded number of processes n, (b) each process p iisthe only process to initiate the sending of the messagePOSITION( id i , neighbors i ),

and (c) any process p j forwards this message only once, it follows that there is afinite time after which no more message is sent Consequently, the algorithm termi-nates at each process While the algorithm always terminates, the important question

is the following: When does a process know that it can stop participating in the gorithm? Trivially, a process can stop when it knows that it has learned the wholecommunication graph (due to the “forward” strategy, when a process knows thewhole graph, it also knows that its neighbors eventually know it) This knowledge

al-can be easily captured by a process p i with the help of its local data structures

proc_known i and channels_known i More precisely, remembering that the pairs

id i , id j and id j , id i are synonyms and using a classical graph closure

prop-erty, a process p i knows the whole graph when ∀ id j , id k ∈ channels_known i :

{id j , id k } ⊆ proc_known i This local termination predicate appears at line13 When

it becomes locally satisfied, a process p i learns that it knows the whole graph andlearns also that its neighbors eventually know it That process can consequently stopits execution by invoking the statementreturn()line14

It is important to notice that the simplicity of the termination predicate

comes from an appropriate choice of the local data structures (proc_known i and

proc_known i) used to represent the communication graph

com-munication graph The diameter of a graph is the longest among all the shortest distances connecting any pair of processes, where the shortest distance between p i and p j is the smallest number of channels to go from p i to p j The diameter notion

is a global notion that measures the “breadth” of the communication graph

For any i and any channel, a messagePOSITION( id i , −) is sent at least once and

at most twice (once in each direction) on that channel It follows that the message

complexity is upper bounded by 2ne.

As far as the time complexity is concerned, let us consider that each messagetakes one time unit and local processing has zero duration In the worst case, a

Trang 37

single process p k receives a messageSTART() and there is a process p at distance

D from p k In this case, it takes D time units for a message POSITION( id k , −)

to arrive at p This message wakes up p , and it takes then D time units for a

messagePOSITION( id , −) to arrive at p k It follows that the time complexity is

upper bounded by 2D.

max( {|neighbors i|1≤i≤n }), and b the number of bits required to encode any tity id i The maximal number of bits needed for a messagePOSITION() is b(d + 1).

iden-When Initially the Channels Have Only Local Names Let us consider a

pro-cess p i that has c i neighbors to which it is point-to-point connected by c i

chan-nels locally denoted channel i [1 c i ] When each process p i is initially given only

channel i [1 c i ], the processes can easily compute their sets neighbors i To that end,each process executes a preliminary communication phase during which it firstsends a messageID(i) on each channel i [x], 1 ≤ x ≤ c i, and then waits until it has

received the identities of the processes at the other end of its c i channels When

p i has received ID( id k ) on channel channel i [x], it can associate its local address channel i [x] with the identity id kwhose scope is the whole system

Port Name When each channel channel i [x] is defined by a local name, the index

x is sometimes called a port Hence, a process p i has c icommunication ports

1.2 Parallel Traversal: Broadcast and Convergecast

It is assumed that, while the identity of a process p i is its index i, no process knows explicitly the value of n (i.e., p n knows that its identity is n, but does not know that

its identity is also the number of processes)

1.2.1 Broadcast and Convergecast

Two frequent problems encountered in distributed computing are broadcast andconvergecast These two problems are defined with respect to a distinguished pro-

cess p a

• The broadcast problem is a one-to-many communication problem It consists in designing an algorithm that allows the distinguished process p a to disseminateinformation to the whole set of processes

A variant of the broadcast problem is the multicast problem In that case, the distinguished process p a has to disseminate information to a subset of the pro-cesses This subset can be statically defined or dynamically defined at the time ofthe multicast invocation

Trang 38

whenGO( data) is received from p kdo

(1) if (first reception ofgo( data)) then

(2) for each j ∈ neighborsi \ {k} dosend GO( data)top j end for

(3) end if.

Fig 1.4 A simple flooding algorithm (code for p i)

• The convergecast problem is a many-to-one communication problem It consists

in designing an algorithm that allows each process p j to send information v j to

a distinguished process p a for it to compute some function f (), which is on a

vector[v1, , v n] containing one value per process

Broadcast and convergecast can be seen as dual communication operations They

are usually used as a pair: p a broadcasts a query to obtain values, one from each

process, from which it computes the resulting value f () As a simple example, p a

is a process that queries sensors for temperature values, from which it computesoutput values (e.g., maximal, minimal and average temperature values)

1.2.2 A Flooding Algorithm

A simple way to implement a broadcast consists of what is called a flooding

al-gorithm Such an algorithm is described in Fig.1.4 To simplify the description,

the distinguished process p a initially sends to itself a message denotedGO( data),

which carries the information it wants to disseminate Then, when a process p i ceives for the first time a copy of this message, it forwards the message to all itsneighbors (except to the sender of the message)

re-Each messageGO( data) can be identified by a sequence number sn a Moreover,the flooding algorithm can be easily adapted to work with any number of distin-guished processes by identifying each message broadcast by a distinguished process

p a with an identity pair (a, sn a )

As the set of processes is assumed to be connected, it is easy to see that the rithm described in Fig.1.4guarantees that the information sent by a distinguishedprocess is eventually received exactly once by each process

algo-1.2.3 Broadcast/Convergecast Based on a Rooted Spanning Tree

The previous flooding algorithm may use up to 2e − |neighbors a| messages (where

eis the number of channels), and is consequently not very efficient A simple way toimprove it consists of using an underlying spanning tree rooted at the distinguished

process p a

Trang 39

Fig 1.5

A rooted spanning tree

Rooted Spanning Tree A spanning tree rooted at p a is a tree which contains n

processes and whose channels (edges) are channels of the communication graph

Each process p i has a single parent, locally denoted parent i, and a (possibly empty)

set of children, locally denoted children i To simplify the notation, the parent of the

root is the root itself, i.e., the distinguished process p a is the only process p i such

that parent i = i Moreover, if j = a, we have j ∈ children i ⇔ parent j = i, and the

channeli, j belongs to the communication graph.

An example of a rooted spanning tree is described in Fig.1.5 The arrows ented toward the root) describe the channels of the communication graph that belong

(ori-to the spanning tree The dotted edges are the channels of the communication graph

that do not belong to the spanning tree This spanning tree rooted at p a is such

that, when considering the position of process p i where neighbors i = {a, k, j, f },

we have parent i = a, children i = {j, k} and consequently parent j = parent k = i Moreover, children i ∪ {parent i } ⊆ neighbors i = {a, k, j, f }.

Algorithms Given such a rooted spanning tree, the algorithms implementing a

broadcast by p a and the associated convergecast to p a are described in Fig.1.6

As far as the broadcast is concerned, p afirst sends the messageGO( data) to itself,

and then this message is forwarded along the channels of the spanning tree, and thisrestricted flooding stops at the leaves of the tree

As far as the convergecast is concerned, each leaf p isends a messageBACK( val_ set i )to its parent (line4) where val_set i = {(i, v i )} (line2), i.e., val_set i contains a

============= Broadcast =============================

whenGO( data) is received from p kdo

(1) for each j ∈ childreni \ {k} dosend GO( data)top jend for.

============= Convergecast ===========================

whenBACK( val_set j ) received from each p j such that j ∈ childrenido

(2) val_set i ← (j ∈children i val_set j ) ∪ {(i, vi )};

(3) let k = parenti;

(4) if (k = i) thensend BACK( val_set i )top k

(5) else the root p i ( = pa ) can compute f (val_set i )

(6) end if.

Fig 1.6 Tree-based broadcast/convergecast (code for p i)

Trang 40

single pair carrying the value v i sent by p i to the root A non-leaf process p i waits

for the pairs (k, v k ) from all its children, adds its own pair (i, v i ), and finally sends

the resulting set val_set i to its parent (line4) When the root has received a set ofpairs from each of its children, it has a pair from each process and can compute the

function f () (line5)

1.2.4 Building a Spanning Tree

This section presents a simple algorithm that (a) implements broadcast and

con-vergecast, and (b) builds a spanning tree This algorithm is sometimes called

prop-agation of information with feedback Once a spanning tree has been constructed,

it can be used for future broadcasts and convergecasts involving the same

distin-guished process p a

Local Variables As before, each process p i is provided with a set neighbors i

which defines its position in the communication graph and, at the end of the

execu-tion, its local variables parent i and children i will define its position in the spanning

tree rooted at p a

To compute its position in the spanning tree rooted at p a , each process p i uses

an auxiliary integer local variable denoted expected_msg i This variable contains

the number of messages that p i is waiting for from its children before sending amessageBACK()to its parent

Algorithm The broadcast/convergecast algorithm building a spanning tree is scribed in Fig.1.7 To simplify the presentation, it is first assumed that the channels

de-are FIFO (first in, first out) The distinguished process p ais the only process whichreceives the external messageSTART() (line1) Upon its reception, p a initializes

parent a , children a and expected_msg aand sends a messageGO( data) to each of its

neighbors (line2)

When a process p i receives a message GO( data) for the first time, it defines

the sender p j as its parent in the spanning tree, and initializes children i to∅ and

expected_msg i the number of its neighbors minus p j (line 4) If its parent is its

only neighbor, it sends back the pair (i, v i ) thereby indicating to p j that it is one

of its children (lines5 6) Otherwise, p i forwards the messageGO( data) to all its

neighbors but its parent p j (line7)

If parent i = ⊥, when p i receivesGO( data), it has already determined its parent

in the spanning tree and forwarded the messageGO( data) It consequently sends by

return to p j the messageBACK( ∅), where ∅ is used to indicate to p j that p i is notone of its children (line9)

When a process p i receives a messageBACK( res, val_set) from a neighbor p j, it

decreases expected_msg i(line11) and adds p j to children i if val_set= ∅ (line12)

Then, if p i has received a messageBACK()from all its neighbors (but its parent,line13), it sends to its parent (lines15–16) the set val_set containing its own pair

re-Each message< small>GO( data) can be identified... i This variable contains

the number of messages that p i is waiting for from its children before sending amessageBACK()to its parent

Algorithm... In this case, it takes D time units for a message< /i> POSITION( id k , −)

to arrive at p This message wakes up p

Định dạng
Số trang	517
Dung lượng	6,69 MB