Checkpointing techniques for minimizing the waiting time during debugging long running parallel programs

Debugging these applications is difficult for several reasons: long execution time, the possibly large amounts of data involved, and nondeterministic program behavior.. While other repla

Trang 1

Checkpointing Techniques for Minimizing the Waiting Time during Debugging Long-Running

Johannes Kepler Universität Linz

eingereicht bei

o.Univ.-Prof Dr Jens Volkert (Betreuer) A.Univ.-Prof Dr Josef Küng (2 Begutachter)

Linz, Juli 2003

Trang 3

Acknowledgements

First of all I would like to thank my advisor, Prof Dr Jens Volkert, who provided me with a good working environment and gave me considerable help in finishing this thesis I am also grateful to Dr Dieter Kranzlmüller, whose expert advice, particularly on the ideas and implementation issues, were indispensable I shall never forget the discussion at the Christmas party in 2000 that led to many ideas in this thesis I am grateful to both gentlemen for their comments on this thesis, and also for their contributions to the papers that I published while at the University of Linz

I would like to thank Dr Nguyen Thanh Son, who was the advisor on my engineer thesis He taught me a lot when I was an undergraduate student as well as when I worked in his research group He gave me basic knowledge to finish this work

The support for my research is a scholarship from the Austrian Exchange Service (OEAD) I would like to thank OEAD, especially the office in Linz, for their support I am thankful to my colleagues in GUP Linz and my close friends for their help and discussion

I dedicate this work to my grandmothers, my parents, my aunts and my uncles (especially to my youngest aunt), and other members of my large family, as an expression of love and gratitude for providing me with the best education Without their love, care and encouragement I would not have finished this thesis I also want to dedicate this work to my deceased grandfathers on account of their love

I am grateful to my brothers and sisters for their emails and chatting; they stopped me from feeling alone here I am also grateful to my girl-friend, who was an endless source of inspiration and motivation that helped me to finish this work

Trang 4

Abstract

Applications of computational science and engineering are often large-scale, long-running parallel programs running on High-Performance Computing (HPC) platforms Debugging these applications is difficult for several reasons: long execution time, the possibly large amounts of data involved, and nondeterministic program behavior The waiting time until a debugger arrives at a certain position during cyclic debugging may be rather long due to the long execution time The large amounts of data are involved, because lots of state information must be stored and analysed Nondeterminism of parallel programs causes subsequent re-executions with the same input data to generate different results

The Shortcut Replay method addresses reproducibility and reduces the waiting time While other replay methods cannot guarantee a short waiting time for arrival at any position during repeated debugging cycles, Shortcut Replay is a record&replay method, in which techniques of checkpointing and bypassing orphan messages are used to establish flexible recovery lines The R2-graph and R2-path (and DR2-path) are introduced to find suitable recovery lines for use in Shortcut Replay

The difficulty is to achieve a satisfactory trade-off between short waiting times and the message logging overhead If the most recent recovery line of any distributed breakpoint is used, the waiting time is obviously short - but the overhead of message logging may be too high, because many messages must be logged The Rollback-One-Step method (ROS) and the Enhancement of ROS (EROS) provide a solution They keep the upper bound of the waiting time to a low value with a minimum message logging overhead The amount of debugging data can be reduced by using Process Isolation, in which only a small group of processes is isolated and debugged ROS can be applied to reduce the waiting time in Process Isolation, but the Minimizing the Replay Time method (MRT) is a better solution in this case The four-phase-replay method is proposed to avoid the probe effect when using MRT in Process Isolation

The methods described in this thesis are intended to make cyclic debugging feasible for large-scale, running parallel programs Combining these methods with interactive debugging tools provides a short waiting time, which is a key requirement for the user’s investigations

Trang 5

long-Kurzfassung

Anwendungsprogramme im Bereich der Informatik und der Technischen Wissenschaften sind häufig große, lang laufende parallele Pogramme für nachrichtengekoppelte Hochleistungsrechner Die Fehlersuche in solchen Anwendungen ist aus mehreren Gründen schwierig: lange Ausführungszeiten, die große Menge zu verarbeitender Daten und das meistens nicht-deterministische Verhalten solcher Programme Die langen Ausführungszeiten bewirken, dass bei der zyklischen Fehlersuche der Zeitraum bis zum Erreichen einer gewünschten Position ziemlich lang sein kann Die große Datenmenge zwingt zur Speicherung von viel Statusinformation, die überdies analysiert werden muss Aus dem Nicht-Determinismus ergibt sich das Phänomen, dass die wiederholte Ausführung des Programms mit gleichen Eingabedaten unterschiedliche Resultate liefern kann

Die in dieser Arbeit entwickelte Shortcut Replay Methode sichert die identische Wiederholbarkeit und reduziert die Wartezeiten bei der Fehlersuche Während andere Replay-Methoden bei wiederholten Fehlersuchzyklen kleine Wartezeiten bis zur Erreichung einer bestimmten Position nicht garantieren können, repräsentiert Shortcut Replay eine Methode, bei der durch Checkpoints und durch Umgehung von sogenannten Orphan Botschaften flexible Wiederaufsetzlinien erzeugt werden Geeignete Wiederaufsetzlinien werden mittels der in dieser Arbeit entwickelten R2-Graphen und R2-Pfade (und den DR2-Pfade) konstruiert

Eine Schwierigkeit bei dieser Vorgehensweise stellt die Suche nach der geeigneten Balance zwischen Wartezeit und der notwendigen Menge an zu speichernden Botschaften dar Nach dem Anhalten eines Programms verursacht zwar der Wiederstart an der zu einem solchen globalen Unterbrechungspunkt nächsten möglichen Aufsetzlinie die offensichtlich geringste Wartezeit, aber meistens muss eine große Menge von Botschaften zwischengespeichert werden Lösungen bieten die Rollback-One-Step Methode (ROS) und deren Verbesserung (EROS) Diese garantieren eine obere Schranke für die Wartezeit bei relativ geringem Speicherbedarf

Um die Menge der zu untersuchenden Daten zu reduzieren, wird die Technik der Prozessisolation benutzt, bei der die Fehlersuche auf eine kleine Gruppe von Prozessen reduziert wird ROS kann in diesem Zusammenhang ebenfalls zur Verkürzung der Wartezeiten benutzt werden, aber in diesem Fall erweist sich die Methode „Minimizing the Replay Time (MRT)“ als besser Der üblicherweise dabei entstehende Probe-Effekt (Einfluss der Aufzeichnungssoftware) wird durch die neu entwickelte 4-Phasen-Replay Methode vermieden

Die in dieser Arbeit beschriebenen Methoden versuchen das zyklische Fehlersuchverfahren für große, lang laufende parallele Programme zu ermöglichen Durch Kombination mit interaktiven Fehlersuch-Werkzeugen sind nur kurze Wartezeiten notwendig, welche eine Grundbedingung für die Untersuchung durch den Benutzer darstellen

Trang 6

Table of Contents

Chapter 1 Introduction 17

Chapter 2 Parallel Applications and Debugging Parallel Programs 21

2.1 Applications in High-Performance Computing 22

2.1.1 Grand Challenge Problems 22

2.1.2 Demand for Parallel Processing 22

2.1.3 Parallel Computer Architecture 23

2.1.4 High Performance Computing 26

2.2 Testing & Debugging 28

2.2.1 Testing 28

2.2.2 Debugging 29

2.2.3 Breakpoint 30

2.2.4 Cyclic debugging 32

2.3 Debugging Parallel Programs 34

2.3.1 Nondeterministic Behavior of Parallel Programs 35

2.3.2 Race Conditions 36

2.3.3 The Irreproducibility Effect 38

2.3.4 The Completeness Problem 38

2.3.5 The Probe Effect 38

2.4 Replay Methods 39

2.5 Summary and Comments 40

Chapter 3 Event Graph Model 42

3.1 Events 43

3.1.1 Program States 43

3.1.2 Event Definition 43

3.1.3 Classification and Characteristics of Events 44

3.2 Event-Ordering Relations 45

3.2.1 Event Ordering 45

3.2.2 The “Happened-Before” Relation 45

3.3 The Event Graph Model 46

3.4 A Simple Event Graph Model 47

3.4.1 Events of Interest 47

3.4.2 Event Record Structure 49

3.4.3 Partial Event Ordering 50

3.5 Summary 51

Chapter 4 Checkpointing & Rollback Recovery 52

4.1 What is Checkpointing? 53

4.2 Checkpointing for Parallel Programs 53

4.2.1 Background and Definitions 53

4.2.2 Consistent Global States 56

4.2.3 Z-Paths and Z-Cycles 56

4.2.4 Recovery Lines and Domino Effects 57

4.3 Interactions with the Outside World 58

4.4 Log-Based Rollback Recovery 58

Trang 7

4.4.1 The No-Orphans Consistency Condition 60

4.4.2 Pessimistic Log-Based Rollback Recovery (Pessimistic Logging) 60

4.4.3 Optimistic Log-Based Rollback-Recovery (Optimistic Logging) 61

4.4.4 Causal Log-Based Rollback-Recovery (Causal Logging) 63

4.5 Checkpoint-Based Rollback Recovery 64

4.5.1 Uncoordinated Checkpointing 64

4.5.2 Coordinated Checkpointing 66

4.5.3 Communication-Induced Checkpointing 67

4.6 Applications of Checkpointing in Other Areas 68

4.7 Checkpointing in Debugging 68

4.7.1 The Use of Checkpointing in Debugging 69

4.7.2 Differing Requirements for Checkpointing in Fault-Tolerant Computing and in Debugging 69

4.8 Summary 71

Chapter 5 OMpiClib: Checkpoint Library for MPI Programs 72

5.1 Process Image 73

5.1.1 Text 73

5.1.2 Data 73

5.1.3 Stack 74

5.1.4 Shared Libraries 74

5.1.5 Files 74

5.1.6 Signals 74

5.1.7 Processor State 74

5.2 Checkpointing Single Process 74

5.3 Checkpointing for MPI Programs: Related Work 76

5.4 OMpiClib 77

5.4.1 Process Connections 78

5.4.2 Checkpointing Data 79

5.4.3 Recovery Process 80

5.5 Summary 80

Chapter 6 Shortcut Replay 81

6.1 Rollback/Replay Distance 82

6.2 Unlimited Rollback Distance with Checkpoint-Based Rollback Recovery 83

6.3 Shortcut Replay 85

6.3.1 Bypassing Orphan Messages 85

6.3.2 Record Phase 86

6.3.3 Two Stages in Replay Phase 86

6.3.4 Detecting Orphan Messages 87

6.4 Summary 89

Chapter 7 Methods to Reduce Logging Overhead When Minimizing the Waiting Time during Debugging 90

7.1 Message Logging in Debugging 91

7.2 ROS: Rollback-One-Step Method 92

7.2.1 Checkpointing 92

7.2.2 Message Logging 93

7.2.3 Rollback Distance 94

7.2.4 The Stronger Condition 97

7.2.5 Experience 100

Trang 8

7.3 EROS: Enhancement of ROS 100

7.3.1 Replay Time of ROS and Improvement 101

7.3.2 Replay Dependence Relation 102

7.3.3 Checkpointing, Message Logging and Replay Time 104

7.3.4 Information Collection 106

7.4 Summary 107

Chapter 8 Recovery Lines with R2-graph 109

8.1 Available Recovery Lines 110

8.2 R2-graphs 110

8.3 R2-path and DR2-path 111

8.4 Constructing a Recovery Line 112

8.5 Summary 114

Chapter 9 Process Isolation and Replay Method 115

9.1 Problems with Large-Scale Parallel Programs 116

9.2 Process Isolation based on an Event Graph 116

9.3 Replay Method in Process Isolation 117

9.4 Reducing the Waiting Time with ROS 118

9.5 MRT – Minimizing the Replay Time Method 119

9.5.1 Checkpointing 120

9.5.2 Message Logging 120

9.5.3 Replay Time 121

9.5.4 The Four-Phase-Replay Method to Avoid Probe Effect 122

9.6 Summary 125

Chapter 10 Future Work 126

10.1 Minimizing the Waiting Time 127

10.1.1 A Good Checkpointing Method 127

10.1.2 A Flexible Recovery Line Based on a Group of Processes 128

10.2 Automatic Testing with Checkpointing 129

10.3 Supporting Group Communication 129

Chapter 11 Summary and Conclusions 131

Chapter 12 References 135

Trang 9

Symbols, Variables and Constants

|| concurrent relation 46

f replay dependence relation 102, 103 → “happened-before” relation 45

Φ set of specification of a program 28

Φk a specification of a program 28

ANCE set of immediate ancestors 83

B k breakpoint on process P k 84

(B 0 , B 1 ,…, B n-1 ) distributed breakpoint 94

C p,i i-th checkpoint on process P p 54

C p,0 initial checkpoint on process P p 54

(C 0,k0 , C 1,k1 ,…, C n-1,k(n-1) ) global checkpoint 54

crit(e) critical time of event e 83

ckpt checkpoint event 47

C → concurrent order 46

d(e 1 , e 2 ) distance between event e 1 and event e 2 on the same process 82

D(G 1 ,G 2 ) replay/rollback distance between global state G 1 and global state G 2 82

Depend(e) set of processes affected nondeterministic event e 60

DO-event Dangerous Output events 85

DR2-path R2-path for only two processes 111

2 DR → DR2-path 111

E set of events e 46

e p,i i-th event on process P p 45

EROS Enhancement of ROS 100

FIFO First-In-First-Out 67

FUTURE(G) time zone for a given global state G 55

G global state 55

event graph 46

group isolated group 116

head(m) an end point of path m in R2-graph 111

Isend nonblocking send event 47

I p,i i-th checkpoint interval on process P p 54

LPs set of logical processes used in simulation 68

Log(e) set of processes that have logged e’s determinant 60

m (or m k) message 54

MI[][] knowledge matrix 98

MRT Miminizing the Replay Time method 119

message(e) the message which is created by send event e 117

n number of processes 54

O-event output event 85

origin(m) the process that sends message m 117

P q process q 54

PAG program activity graph 83

PAST(G) time zone for a given global state G 55

PC program counter 73

PCs personal computers 27

PWD piecewise deterministic 39

R2-graph Rollback Replay graph 111

Trang 10

R2-path a path in R2-graph 111

receive blocking receive event 47

receive(m) receive event of message m 55

RIS(e) replay dependence set of event e 103

RIS(I) replay dependence set of checkpoint interval I 103

RLSk(e) replay layer dependence set k of event e 104

RLSk(I) replay layer dependence set k of checkpoint interval I 103

ROS Rollback-One-Step method 92

RS(e) real replay dependence set of event e 105

RV[] checkpoint interval index vector used in recovery process 88

2 R → R2-path 111

send blocking send event 47

Ssend blocking synchronous send event 47

send(m) send event of message m 55

SP stack pointer 73

Stable(e) a predicate for logging of e’s determinant 60

SV[] checkpoint interval index vector 88

S → sequential order 45

T maximum execution time of all checkpoint intervals 54

T(e’,e) weight of the directed edge from e’ to e in PAG 83

tk time since the last checkpoint 101

TD[] vector used in transitive dependency tracking 62

tail(m) an end point of path m in R2-graph 111

test nonblocking receive event 47

type(e) type of event e 117

Z-path zigzag path 56

Trang 11

List of Figures

Figure 2-1 Shared-Memory Multiprocessors 23

Figure 2-2 Distributed-Memory Multicomputers 24

Figure 2-3 Distributed Shared-Memory System 25

Figure 2-4 Overlapped design space of clusters, MPPs, SMPs

and distributed computer systems [HwXu 98] 26

Figure 2-5 Classification of breakpoints [Kacs 00] 31

Figure 2-6 Breakpoint in cyclic debugging [Kran 00a] 33

Figure 2-7 Race in shared-memory programs 36

Figure 2-8 Race condition in message-passing programs 37

Figure 4-1 Message-passing programs 54

Figure 4-2 Checkpoints and transferred messages 54

Figure 4-3 PAST and FUTURE of a global checkpoint 55

Figure 4-4 Z-Paths and Z-Cycles 57

Figure 4-5 Domino effect 58

Figure 4-6 Pessimistic logging 60

Figure 4-7 Optimistic logging 62

Figure 4-8 Exponential rollbacks 62

Figure 4-9 Causal Logging 64

Figure 4-10 (a) Program execution; (b) Rollback-dependency graph; (c) Checkpoint graph 65

Figure 4-11 Non-blocking checkpoint coordination 67

Figure 5-1 Address space of a process 73

Figure 5-2 Socket connections between the MPI Daemon and other MPI processes

(a) Initial execution (b) After recovery 78

Figure 5-3 Address space of an MPI process 79

Figure 6-1 Example for computing rollback/replay distance 82

Figure 6-2 Consistent and inconsistent recovery lines 84

Figure 6-3 Two stages in a replay process 84

Figure 6-4 Trace data of nondeterministic events 85

Figure 6-5 Method of detecting orphan messages 87

Figure 6-6 Identifying orphan messages in FIFO-computation 88

Figure 7-1 Forcing a process to take a checkpoint immediately 92

Figure 7-2 Message logging in ROS 93

Trang 12

Figure 7-3 A set of rollback processes 96

Figure 7-4 Knowledge matrix 97

Figure 7-5 Percentage of number of logged messages per number of total messages 99

Figure 7-6: Waiting for a message in re-execution 100

Figure 7-7 Waiting messages and replay time 100

Figure 7-8 The worse case during re-execution with ROS 101

Figure 7-9 Replay dependence 102

Figure 7-10 The waiting chain of processes 104

Figure 7-11 Percentage of number of logged messages per total number of messages 107

Figure 8-1 (a) Checkpoints and superfluous messages; (b) R2-graph 110

Figure 8-2 R2-path and DR-path in R2-graph 111

Figure 9-1 Sample event graph for a large-scale parallel program 116

Figure 9-2 ROS in process isolation 118

Figure 9-3 Dependence in replaying 121

Figure 9-4 The four-phase-replay method 122

Figure 9-5 Percentage of number of logged messages per number of total messages 124

Figure 10-1 Placing checkpoints 127

Figure 10-2 Checkpointing in testing 128

Figure 10-3 Group communication problem 129

Trang 13

List of Tables

Table 3-1: Event types and event data [Kran 00a] 44

Table 3-2: Sample event record structure for MPI_Recv [Kran 00a] 49

Table 3-3: Example event record structure for checkpoint event 50

Table 4-1 Comparison between different styles of rollback-recovery techniques [ElJo 99] 70

Table 5-1 Comparison of various checkpointing systems 77

Table 7-1 Message logging overhead 99

Table 7-2 Message logging overhead of EROS & ROS 107

Table 9-1 Message logging overhead 124

Trang 14

Chapter 1 Introduction

Trang 15

1 Introduction

High-Performance Computing (HPC) technologies have been substantially improved in terms of usability and programmability over the last decade The latest approaches to low-cost HPC computing on clusters of workstations on the one hand, and high-performance, high-throughput computing on computational grids

on the other [BaLa 02], have further pushed the HPC technology into the mainstream market This fact is underlined by many scientific applications that either have been or are being developed to run on PC clusters or the Grid [Grid 01] [Grid 02] [Grid 03] [Grid 04] [Grid 05] [Grid 06] However, there are still some problems, which are not sufficiently solved One of them is debugging long-running parallel programs, which is especially important in the context of HPC platforms, because many applications in computational science and engineering involve large-scale processes and a long execution time: days, weeks, or even months

Debugging is essential part of software development but it is difficult and tedious The aim is to detect and correct critical errors in order to improve the reliability of the code At present, a traditional and popular method is cyclic debugging, in which a program is executed repeatedly in order to examine program states as they occur and finally to detect the origins of erroneous behavior Debugging is difficult anyway, but debugging parallel programs is even more difficult and complex The main reason is nondeterministic behavior of parallel programs In this case, consecutive runs of a program may result in different executions, even if the same input data are provided This so-called irreproducibility effect causes serious problems for cyclic debugging, because subsequent executions of a program may not reproduce the original bugs

To solve the above problem, deterministic replay methods or record&replay mechanisms have been developed [LeMe 87] [PaLi 89] [NeMi 92b] [ChBo 99] [KrVo 99a] During an initial record phase, a trace

of the program’s nondeterministic choices is recorded, which may afterwards be used to ensure that the execution is equivalent to the initially observed execution Despite these efforts in combining cyclic debugging and deterministic replay, debugging long-running parallel programs is still a challenge, since execution times of days, weeks, or even months are quite common Every time, the programmer wants to inspect the value of a variable at an intermediate point, the program must be re-executed from the beginning

re-to that point Since cyclic debugging usually requires several re-execution cycles, the debugging task may become very time-consuming

This problem can be addressed by reducing the waiting time with a combined checkpointing and debugging approach [NeXu 93] Checkpointing is a technique where the state of a running program is saved sometimes during execution The intermediate states are used to generate recovery lines, which may afterwards serve as a restarting point of a program’s execution [Plan 97] In other words, with this combined approach it is no longer necessary to replay the target program from the beginning While the application of checkpointing to other fields such as fault-tolerant computing, distributed simulation, etc have been studied in depth, the combination of checkpointing and debugging is not studied sufficiently Research into the combination is needed to satisfy the requirements of debugging tools Several surveys, e.g [PaCo 94], show that a principle reason for tools not being widely used is lack of user-friendliness One characteristic of user-friendliness is interactivity When users apply an operation, they usually expect an immediate response But the waiting time may be rather long in some cases, due to limitations of current deterministic replay techniques, especially with long-running parallel programs

This thesis contributes a solution to the above problems We show that the main obstacle is the restriction in the way of the current replay methods to establish a recovery line Consequently, a new replay

method called Shortcut Replay has been developed It supports message-passing programs with

point-to-point communication Shortcut Replay uses event graph as the program execution’s model It consists of two phases: (1) record and (2) replay In the record phase, only a small amount of significant information is traced In the replay phase, this information is used to control re-execution Especially, any orphan messages are skipped during re-execution This characteristic allows Shortcut Replay to choose a recovery line flexibly so that the rollback/replay distance is shorter than in other methods The rollback/replay

Trang 16

distance is the gap between the recovery line and the corresponding distributed breakpoint The replay time

is short due to the short replay distance

In addition, the overhead of a combination of checkpointing and debugging for parallel programs is also studied There are two main overheads: (1) the checkpointing data and (2) the storage of transferred messages While incremental checkpointing technique [FeBr 89] [NaLe 97] is available as an efficient method to reduce the amount of checkpointing data, minimizing the storage of transferred messages is still

a challenge Although some methods in this field are used in fault-tolerant computing, they cannot be applied to debugging, due to the new construction of the recovery line in Shortcut Replay Consequently,

two new methods named Rollback-One-Step (ROS) and Enhancement of ROS (EROS) are developed

Based on the relation between checkpoint events, ROS decides whether the incoming messages are logged

in stable storage or not It ensures that the upper bound of the rollback/replay distance is 2T, where T is the maximum execution time of all checkpoint intervals during the initial execution As an improvement on ROS, EROS uses the relation between intervals to decide whether to store the incoming messages or not, in order to ensure an upper bound of 3T for the replay time The overhead due to message logging in both methods is small Experience with some sample programs shows that the number of logged messages is often less than 5% of the total number of messages

To debug large-scale parallel programs, Process Isolation [KrVo 02] is an efficient method, in which a subset of the original number of processes can be investigated The remaining processes are ignored by eliminating them from the debugging process The number of processes can be reduced to an arbitrary subset of the original processes, while the “surroundings” of these processes are simulated by the debugging system To minimize the waiting time for the isolated processes, ROS can be used In this case the necessary information can be obtained from the trace data, so that a decision about message logging is

more accurate than one obtained on the fly Consequently, a new method named MRT is presented; it may

be better than ROS in this respect, because MRT ensures that the upper bound of the replay time is 2T The

probe effect may cause serious problems in debugging and thus the four-phase-replay method is proposed

to avoid probe effect when using MRT in Process Isolation

Finding a suitable recovery line with a low overhead is an interesting task Many methods have been offered but they cannot be used here, because the rollback/replay distance may be undesirably long in some cases Therefore a new method, which helps to construct a suitable recovery line, is shown First, an R2-graph is constructed It includes checkpoint events and information about transferred messages which are not logged The relation between the checkpoint events in the R2-graph is called R2-path or DR2-path (Direct R2-path) Then a suitable recovery line can be obtained by a simple method based on the R2-graph and DR2-path (and R2-path)

All the above techniques and methods are implemented and verified They are integrated into the NOPE (NOndeterministic Program Evaluator) library [KrVo 99a], which had been developed previously in GUP Linz, Johannes Kepler University Linz In addition, we have developed a checkpointing library named OMpiClib, because no existing checkpointing library satisfies our requirements OMpiClib has been developed from Libckpt, a general-purpose checkpointing library for Unix-based systems which has been developed at the University of Tennessee Libckpt is a checkpointing library for a single process, while OMpiClib is a checkpointing library for MPI programs OMpiClib is a user-level checkpointing library with some advantages, such as:

- The sizes of the image files are limited, because a lot of data from the MPI library is not stored

- The process can take checkpoints autonomously

- Recovery process is fast and does not require coordination

Organization

The rest of the thesis is organized as follows:

Chapter 2 introduces the objective of our approach, which is demand for applications running on HPC platforms and difficulty during debugging large-scale, long-running parallel programs Problems in using

Trang 17

cyclic debugging for parallel programs, e.g nondeterminism, irreproducibility, the probe effect, etc., are presented Solutions to these problems are also discussed

Chapter 3 introduces the event graph model that is used to model a program’s execution The events of interest and the information related to them are described The event graph model is used in later chapters Chapter 4 presents the state-of-the-art as regards checkpointing techniques The application of checkpointing in many fields is discussed Analysis of the different requirements of checkpointing in debugging and in other areas reveals the shortcomings of using current checkpointing techniques to reduce the waiting time during debugging long-running parallel programs

Chapter 5 describes OMpiClib and introduces other checkpointing libraries for MPI programs OMpiClib is a checkpoint library which supports checkpointing MPI programs and is used in debugging Currently, it is developed to run on SGI Origin 3800

Chapter 6 presents Shortcut Replay First, the limitation of current methods is shown Then the record&replay phases in Shortcut Replay are introduced In addition, the method used to detect orphan messages is also presented

Chapter 7 presents two methods ROS and EROS, designed to minimize the waiting time with a low overhead The problems of message logging in debugging and a discussion of available solutions are presented first Then the details of ROS and EROS are introduced

Chapter 8 introduces the R2-graph and DR2-path (and R2-path) A method built around these which helps to find a suitable recovery line for use in Shortcut Replay is presented

Chapter 9 discusses the problems of debugging large-scale, long-running parallel programs Then Process Isolation is introduced In order to minimize the waiting time in Process Isolation, ROS can be used, but it is improved to boost efficiency In addition, a better solution called MRT and the four-phase-replay method are also introduced

Chapter 10 provides directions for future work Finally, Chapter 11 presents conclusions

Trang 18

Chapter 2 Parallel Applications and Debugging Parallel Programs

The problem domain relates to parallel computing and thus this chapter describes the need for performance computers, especially in science and engineering disciplines Afterwards we discuss testing and debugging in software development

high-We are interested in cyclic debugging with breakpointing The problems in applying this technique to parallel programs, such as nondeterminism, irreproducibility, the completeness problem, and the probe effect, are discussed In addition, we also survey record&replay methods used to solve the nondeterminism problem The disadvantages of current record&replay methods when applied to long-running, large-scale parallel programs are explained

Trang 19

2 Parallel Applications and Debugging Parallel Programs

Computer science is applied in many areas, especially in computing It helps to solve many problems, but large-scale problems involving large-scale calculations are still challenges, due to current limitations of computing power IBM’s Blue Gene project, which builds a massively parallel computer used in biomolecular simulations [Alle 01], is a clear proof of the necessity of powerful computation There is a continual demand for greater computational speed in computer systems than is currently possible [WiAl 99] Such problems present so-called “Grand Challenges” [Gran 89], which are defined as follows:

Definition 2-1: Grand Challenge Problems [WiAl 99]

A grand challenge problem is one that cannot be solved in a reasonable amount of time with

today’s computers

Grand Challenge applications are fundamental problems in science and engineering with broad economic and scientific impact [Gran 96] These fundamental problems generate increasingly complex data, require realistic simulations of the process under study, and demand intricate visualizations of the results [Gran 99] They are generally considered intractable without the use of state-of-the-art massively parallel computers, due to the large-scale calculations involved They include such problems as predicting the weather, analyzing fuel combustion, ocean modeling, the rational design of drugs, and determining the origin of structure in the universe [Gran 95] Tackling many of these problems require computer simulation [Koni 00], which has joined observation and theory as one of the fundamental paradigms of modern science

2.1.2 Demand for Parallel Processing

Almost all large-scale problems relate to simulations, where mathematical models of physical phenomena are translated into computer software that specifies how calculations are performed, using input data that may include both experimental data and estimated values of unknown parameters in the mathematical models By repeatedly running the software, using different data and different parameter values, an understanding of the phenomenon of interest emerges The realism of these simulations and the speed with which they are produced affect the accuracy of this understanding and its usefulness in predicting change

As mentioned above, large-scale problems often require great computational speed to give valid results, due to huge repetitive calculations on large amounts of data Furthermore, some applications require a

“reasonable” time period to complete [WiAl 99] For example, a weather forecasting program that requires two days to compute the next day’s weather is certainly useless Due to many important large-scale problems, we need more and more powerful computers which can solve a problem in a short time

Based on current technology, the power of a computer depends on the processor speed, the amount of memory, the I/O speed, etc Obviously, the processor speed is increasing fast but not enough to satisfy some users’ requirements Unfortunately, the processor speed is limited due to the limit of electronic speed [Tane 95] In addition, the uniprocessor computer model does not support a large memory in a single system

To deal with the challenge of how to solve large-scale problems with current computers, parallel

computers have been introduced This computing platform is a specially-designed computer system

containing multiple processors or several independent computers interconnected in some way [WiAl 99] A parallel computer achieves a higher computational speed than a uniprocessor computer In theory, the speed

Trang 20

is increased n times if n processors are used and thus the problem is completed in 1/nth of the time Of

course this is difficult to achieve in practice One reason is that the problem is rarely divided perfectly into independent parts and interaction is necessary between the parts However, substantial improvements may

be achieved, depending upon the problem and the amount of parallelism in the problem Therefore, parallel computing is sometimes called the “key enabling technology” of modern computers, that make it possible

to fulfil the ever-increasing demand for higher and higher performance at lower production costs, and thus allows sustained productivity in real-life applications [Hwan 93]

Both the overall computational speed of processors in parallel computers and the amount of memory are increasing Since each processor can access its local memory, the total amount of memory can be scaled up Please note, that the amount of memory also affects the computational speed, because the data to be processed have to be transported from memory to CPU (Central Processing Unit) and back In addition, the accuracy of the results is also achieved as a consequence Due to limitations such as speed and memory in computing systems available in uniprocessor computers, many programs could not be completed with sufficient accuracy and in a short enough time to be of interest Of course, the higher calculation speed and the amount of memory that are available in parallel computers, the higher the degree of accuracy that can

be achieved

2.1.3 Parallel Computer Architecture

Parallel computers can be classified into several kinds on the basic of their hardware architecture There

is no standard for this classification However, only a few classification schemes are widely used First, we have to deal with one very well-known classification, which is proposed by Flynn1 [Flyn 72] Based on the number of instruction streams and the number of data streams, parallel computers are divided into four basic classes:

- SISD: Single Instruction stream, Single Data stream

- SIMD: Single Instruction stream, Multiple Data streams

- MISD: Multiple Instruction streams, Single Data stream

- MIMD: Multiple Instruction streams, Multiple Data streams

Figure 2-1 Shared-Memory Multiprocessors

1 It is called Flynn classification

Trang 21

SISD covers traditional sequential computers, in which there is only one CPU; hence they are restricted

to one instruction stream that is executed serially SIMD can be found in array computers and vector processors The computer has many processing units and all of them can execute the same instruction on different data in lock-step The type of computers in which multiple instructions should act

on a single stream of data is MISD Unfortunately, it seems difficult and often useless to build a computer with such architecture [HwXu 98] Computers which use MIMD architecture can execute several instruction streams in parallel on different data A main task can be divided into several subtasks and these subtasks can be executed in parallel on a MIMID computer in order to shorten the run time Basically, there are a large variety of MIMD parallel computers and thus Flynn taxonomy proves to be not fully adequate for the classification of parallel computers, especially in MIMD [StDo 02]

Parallel computer architectures can be classified based on the ways that processors access memory In this classification, three main types of architectures can be distinguished:

- Shared-Memory Multiprocessor (SMM)

- Distributed-Memory Multicomputer (DMM)

- Distributed Shared-Memory (DSM)

The key technique used in SMM is that there is a global memory area shared between processors The

memory is thus called the shared memory The connection between processors and the shared memory is through some form of interconnection network (See Figure 2-1) The system employs a single address

space, which means that each location in the whole main memory system has a unique address and this

address is used by each processor to access the location The main advantage of the SMM model is that it is easy to develop programs Since data are accessible to all processors and programmers do not worry about synchronization, programming a shared memory program is convenient However, high levels of parallelism are difficult to obtain if the number of processors increases, so usually not more than 64 processors are involved

…

Interconnection network

Figure 2-2 Distributed-Memory Multicomputers

In DMM, each processor has its own local memory, which cannot be accessed from the other processors There is no a single address space in this model Communication and synchronization between the processors is done by exchanging messages over the interconnection network, as shown in Figure 2-2

In comparison with SMM, DMM is easy to scale Using this architecture, massively parallel processors

(MPPs) can be built, with up to several hundred or even thousands of processors Computer clusters are also one kind of DMM, where self-contained computers that could operate independently are used The advantage of clusters is low cost Furthermore, clusters have a much better cost/performance ratio than MPPs but they are still unable to reach the peak performance of MPPs

Trang 22

In order to combine the advantages of both SMM and DMM, the third model, DSM, is proposed A

DSM system is just like a DMM system, with the enhancement that each processor has access to the whole

of memory using a single memory address space (see Figure 2-3(a)) A processor certainly can access its

local memory, while message passing is needed to get data not in its local memory This is done

automatically Sometimes the phrase “shared virtual memory” is used in this case to give the illusion of

shared memory even when it is distributed To reduce the time taken to access data from other processors,

each processor uses a cache, which also keeps the number of memory access conflicts and the extent of

network contention low (see Figure 2-3(b))

In an effort to classify parallel computers today, Kai Hwang and Zhiwei Xu have proposed another

approach [HwXu 98], which includes five categories:

- PVP: Parallel Vector Processors

- SMP: Symmetric Multiprocessors

- MPP: Massively Parallel Processors

- DSM: Distributed Shared-Memory Machines

- COW: Clusters of Workstations

Shared memory

P2

Figure 2-3 Distributed Shared-Memory System

A PVP system contains a small number of powerful custom-designed vector processors, each capable of

at least 1 Gflop/s performance Examples of PVPs are the Cray C-90, the Cray T-90, and the NEC SX-4

The disadvantages of PVPs are that the number of processors is small and the price is high

The hardware technique used in SMPs is different from that in PVPs SMPs use commodity

microprocessors with on-chip and off-chip caches and access a shared memory through a high-speed bus

Besides the bus, a crossbar switch can be used The word “symmetric” refers to the fact that every

processor has equal access to the shared memory, the I/O devices, and the operating system services The

disadvantage of SMPs is the scale limitation due to usage of shared memory and shared bus Examples of

SMPs include the IBM R50, the SGI Power Challenge and the DEC Alpha server 8400

In order to build powerful parallel computers, which include a large number of processors, the

distributed memory model is used in MPP, DSM and COW MPPs can include hundreds or even thousands

of processors This computer class generally refers to every large-scale computer system They use

commodity microprocessors, which are connected through high communication bandwidth and low latency

Trang 23

Kai Hwang and Zhiwei Xu’s DSM is the same as the above DSM model The memory is distributed physically, but a single address space is given in the high level Examples of this kind of machine are the Stanford DASH architecture and the Cray T3D

Node Complexity

Distributed Computer Systems

Clusters SMPs

MPPs

Single-System Image

Figure 2-4 Overlapped design space of clusters, MPPs, SMPs

and distributed computer systems [HwXu 98]

The class of clusters, e.g Digital’s TruCluster, IBM SP2, and Berkeley NOW, are COWs Basically, COWs are the low-cost variation of MPPs, where each node is a “headless” workstation (a complete workstation, probably minus some peripherals, e.g monitor, keyboard, mouse, etc.) and they are connected through a low-cost commodity network (e.g Ethernet, FDDI, Fiber-Channel, ATM switch, or Myrinet) While the network interface in MPPs is tightly coupled, the network interface in COWs is loosely coupled

to the I/O bus in a node Another difference between MPPs and COWs is that a local disk exists at each node in COWs but not in MPPs In addition, each node has an operating system in COWs, whereas it is only a microkernel in MPPs

However, the boundaries between these classes of parallel computers are fuzzy Sometimes it is difficult

to assign an architecture to only one of the above classifications For example, a cluster may use a proprietary high-performance switch as the communication network in the IBM SP2; or each node in COWs may itself be an SMP Therefore Kai Hwang and Zhiwei Xu showed the overlaps between the four architectures SMPs, MPPs, DSMs and COWs as shown in Figure 2-4

2.1.4 High Performance Computing

Due to the demand for computing power, many powerful computers, so-called supercomputers, are

produced Supercomputers are regarded as computers with a performance an order of magnitude higher [Oyan 02] Of course, supercomputers need not be parallel computers However, most supercomputers today are parallel computers They have to employ parallelism in one way or another to achieve higher performance Supercomputers feature all the architectures described in Section 2.1.3

The TOP 500 supercomputer list is declared twice a year at the Mannheim Supercomputer conference; since 1993 it has also been declared in November at the Supercomputer conference series Supercomputers

Trang 24

are ranked based on the LINPACK benchmark, which allows the user to scale the size of the problem and

to optimize the software in order to achieve the best performance for a given machine [Linpack] The LINPACK benchmark was created and is maintained by Dongarra at the University of Tennessee [Dong 94] Information on the TOP 500 supercomputer list can be seen in the site http://www.top500.org

Currently, the term “supercomputer” is replaced with the name “high-performance computer” Please

note that supercomputers are high-performance computers, but the expression “high-performance computer” emphasizes their continuity with ordinary computers [Oyan 02] In the past, we distinguished supercomputers from PCs (Personal Computers) due to their power Nowadays, PCs can be connected to build powerful computers Therefore high-performance computers have become part of a continuum of computing power starting with PCs

High-Performance Computing (HPC) is an emerging discipline concerned with solving large-scale

problems using high-performance computers [PoUt 97] Since grand challenge applications are fundamental problems in science and engineering with broad economic and scientific impact, it is necessary

to enhance HPC technologies in order to solve these applications In HPC, it is necessary to develop not only hardware but also software It is easy to see the improvement in the hardware area On the one hand, factors which determine the power of a computer, such as speed of processor, time to exchange data between processors, time to access data in memory, I/O speed, etc are improving more and more On the other hand, new technologies which make it possible to integrate many processors in a computer or to connect many computers in order to achieve better performance are also being developed

There are still problems in the software area The architecture of high-performance computers varies widely, from parallel and distributed architecture to networks of workstations To utilize the potential performance of such high-performance computers, a range of new and enhanced software tools are needed Tools used in this area include [ApBe 96]:

- Compilers: They generate the executable program from the source program

- Program Restructurers and Parallelizers: These tools convert sequential, or partially parallel, programs into efficient parallel programs The inputs and outputs are source code There are several different classes of parallelizers aimed at instruction-level parallelism, task-level parallelism, etc

- Program Specification and Construction Tools: These tools help to construct parallel programs, usually by composing sequential code fragments

- Static Analyzers: They are used to detect both potential bugs (such as race conditions) and poor resource utilization (e.g processor, memory, or cache)

- Parallel Debuggers: These tools allow programmers to control and monitor execution of individual tasks

- Execution and Performance Analyzers: They are used to examine what happened during the execution of the program and to determine resource bottlenecks Some of these tools can suggest a solution

- Libraries: They help to reduce development effort

Unfortunately, software technology and tools have lagged behind hardware technology Discussing this problem, Appelbe and Bergmark [ApBe 96] remark “there is a serious lack of effective software tools, which leads to wasted computer resources and inhibits the use of high-performance parallel computers by scientists” They also give the following reasons for the lack of efficient software tools:

- Hardware vendors supply very limited tools

- The majority of available tools are not production quality

- There is too much diversity, and there are too few standards, in high-performance computing

- Users are uninterested in using the available tools

- The available tools have poor user interfaces, or do not give users what they want

There are still many problems in software technology for high-performance computing In order to construct effective software tools, the first important step is to develop suitable techniques to deal with the

Trang 25

scalability problem There is no guarantee that, if a method works perfectly for a small problem, it can be applied to solve a large-scale problem There is the same problem with algorithms

A lot of current methods and algorithms were developed for sequential programs and used to solve small-scale problems in the past Some of them either cannot be used for parallel large-scale programs or do not give a good performance This is an obstacle to using high-performance computers As a contribution in this area, we will discuss and propose a solution to problems in debugging long-running, large-scale parallel programs

Program testing, debugging, and performance analysis is a run-of-the-mill task for any programmer or engineer who is involved in the development of software products [Plat 84] Evidently, a program is valid if

it runs correctly A reliable programming systems product is the assurance that the program will perform satisfactorily in terms of its functional and nonfunctional specifications within the expected deployment environments In a typical commercial development organization, the cost of providing this assurance via appropriate debugging, testing and verification activities can easily range from 50 to 70 percent of the total development cost [HaSa 02] Consequently, finding a solution to reduce the costs in this process is necessary and important

2.2.1 Testing

Testing is used to examine the input/output behavior of a program corresponding to its specification [Kran

00a] In a software life-cycle the testing phase is usually initiated whenever a module or a complete program has been implemented and its correct behavior must be reviewed For that reason testing is closely related to the two activities verification and validation While the former is used to demonstrate the target’s correctness according to its behavioral specifications, the latter is applied to check the target’s correct execution for a limited set of inputs

Verification, as defined by IEEE/ANSI [IEEE 83], is the process of evaluating a system or component to determine whether the products of a given development phase satisfy the conditions imposed at the start of

that phase Let P be the program written in language L and it is expected to satisfy a set of specifications Φ

= {Φ1, Φ2,…, Φn} The verification shows that P satisfies Φ Thus, verification is the process of proving or demonstrating that the program correctly satisfies the specifications [HaSa 02] In this case, the term verification is used in the sense of “functional correctness”

In contrast to verification, the process of validation is applied to check, if the system or one particular module corresponds to the given requirements with correct results according As defined by IEEE/ANSI [IEEE 83], validation is the process of evaluating a system or component during or at the end of the development process to determine whether it satisfies the specified requirements It normally involves the execution of actual software on a computer and usually exposes any defects Two basic approaches can be distinguished: the validation of the complete system (according to the user's requirements) and the validation of the functional requirements

Whereas verification proves conformance with a specification, testing finds cases where a program does not meet its specification [HaSa 02] This means that given a specification Φ and a program P, testing is the

procedure to find as many Φ1, Φ2,…, Φp∈Φ as possible not satisfied by P Any activity which exposes the

program behavior violating a specification can be called testing Testing can be classified into “static testing” and “dynamic testing” Static testing includes activities such as design review, code inspection and static analysis of source code, even though source code is not being executed in the process of finding the error or unexpected behavior In contrast to this, dynamic testing requires execution of code Dynamic testing can be divided into two categories: black-box and white-box testing [Paul 99]

During black-box testing, the program is only verified according to its specification Often, such tests are applied, when the input space and output space of a program are finite In that case, a set of black-box

Trang 26

tests can be constructed, that can show conclusively that the program is correct and its behavior matches its specification [Boeh 76] In contrast to this, white-box testing opens up the “box” and looks at the specific logic of the program to verify how it works [Chu 97] Tests use logic specifications to generate variations

of processing and to predict the resulting outputs

The purpose of testing is to detect all errors in the program However, a principal problem of testing is that, in the general case and for a program of arbitrary size and complexity, it can always only demonstrate the existence of errors, but can never prove the absence of errors [Paul 99] Furthermore, exhaustive testing

is usually impossible due to the amount of possible inputs of a program in most software systems and a huge or infinite number of execution scenarios to be performed [PeYa 96] Thus an important step in the testing process is the development of a carefully selected test suite Evidently, this does not give a complete solution Therefore a very common approach is to fix every detected bug and perform testing until the project management orders a product release Bach tries to articulate and perhaps quantify the amount of testing [Bach 98] that is sufficient for different projects depending on the circumstances and may be useful for everyday projects as well as mission- or life-critical projects This leads to the idea of “good enough testing”, that is the process of developing a sufficient assessment of quality, at a reasonable cost, to enable wise and timely decisions to be made concerning the product [Bach 98]

2.2.2 Debugging

Despite efforts to develop formal specifications and program verification techniques, nearly all programs that have ever been written exhibit some erroneous runtime behavior [LeBl 89] Thus one of the important parts of testing is debugging or error location and correction It is an essential part of the software life-cycle, because a program is obviously useful only if it executes correctly and is sufficiently reliable [WiAl 99]

Definition 2-2: Debugging [IEEE 83]

Debugging is the process of locating, analyzing and correcting suspected faults, where fault is defined as an accidental condition that causes a program to fail to perform its required function

In other words, the process of debugging involves analyzing and possibly extending (with debugging statements) the given program that does not meet the specifications, in order to find a new program that is close to the original and does satisfy the specifications [HaSa 02] This means that, given a specification Φ

and a program P 1, not satisfying some Φk ∈ Φ, the task is to find a program P 2 “close” to P 1 that does satisfy Φk

In software development debugging is usually used at three successive stages The first is during the coding process, in which the programmer translates the design into an executable code All errors in this process must be quickly detected and fixed before the code goes to the next stage of development Here programmers often use unit testing to expose any defects at the module or component level The second place for debugging is during the later stages of testing, involving multiple components or a complete system, when unexpected behavior such as wrong return codes or abnormal program termination may be found The third place for debugging is in production or deployment, when the software under test faces real operational conditions

Although testing and debugging are highly accepted as an important step in the software engineering domain, much of the computer science community has largely ignored the debugging problem [Lieb 97] Furthermore, few books thoroughly discuss software debugging at a practical level [Stit 92] Of course, there are books and papers describing the features of one or more debugging tools from a user’s point of view, but these do not provide a general survey of the problem of curing program bugs From the scientific point of view, debugging is still studied very little, compared, for example, to compilers [Rose 96] In

Trang 27

addition, debugging is still absent from the current computer-science curriculum, which leaves students and programmers without much guidance

If the program is compiled into binary form, the activity of debugging is usually aided by a program called a debugger, which offers the programmer a framework where he can examine the inside of the running program There are two possible definitions:

Definition 2-3: Debugging tool [Stit 92]

A debugging tool is anything that provides useful knowledge about a program’s execution and its occurring program state changes Different forms of evidence are provided by software debuggers, source language instrumentation, instrumentation drivers, interrupt interception drivers, and hardware monitoring aids

Definition 2-4: Debugger [Rose 96]

A debugger is a tool to help track down, isolate, and remove bugs from software programs In fact debuggers are tools to illuminate the dynamic nature of a program in order to understand

it, as well as to find and fix defects

In other words, a debugger is a tool that controls the application being debugged so as to allow the programmer to follow the flow of program execution and, at any desired point, to stop the program and inspect the state of the program to verify its correctness [Rose 96] In addition, Rosenberg also compares debuggers to other tools, like the magnifying glass, the microscope, the logic analyzer, the profiler, and the browser, with which a program can be examined In principle, a debugger is something like a sophisticated software analyzer [Kran 00a] Debuggers help the programmer to reduce the debugging time As software gets continuously more complex, debuggers become more and more important in tracking down bugs The basic capabilities of a debugger are [Kran 00a]:

• To follow the executable program step by step (tracing)

• To stop and continue the program’s execution at breakpoints (breakpointing)

• To display and manipulate the contents of variables, registers, and memory locations (inspection)

In [Rose 96], Rosenberg also describes basic principles of designing and developing a debugger

• Heisenberg principle: the debugger must intrude on the debuggee in a minimal way It is important

to any sort of in-process testing or monitoring that the test procedure does not unduly affect the normal operation of the system being tested The act of debugging an application should not change the behavior of the application If this is not the case, the usefulness of the debugger is in question

• Truthful debugging: at all costs, the debugger must be truthful, so that the programmer can always trust it The debugger must never mislead the programmer because the programmer is frequently testing theories of how the observed failure may be caused Any misinformation may send the programmer off in the wrong direction

• Program context information: context is the torch in a dark cave The debugger’s most important role is the presentation of content information so that the user always knows where he is and how he got there in the debuggee

In addition, Rosenberg addresses the problem that system developments occur long before any correspondingly powerful debugging support for the new system developments is available

2.2.3 Breakpoint

Breakpoint setting is one of the fundamental mechanisms for debugging programs [DoLi 00] This work is based on breakpoints, which can be defined as follows:

Trang 28

Definition 2-5: Breakpoint and breakpointing [Stit 92]

A breakpoint is a controlled way to force a program to stop its execution Breakpointing may occur on software interrupt calls, on calls to a program subfunction or on selected points within the program

Parallel breakpoints

Figure 2-5 Classification of breakpoints [Kacs 00]

Breakpoint setting allows the programmer to halt and examine a program at interesting points in its execution More precisely, it gives a debugger the ability to suspend the debuggee when its thread of control reaches a particular point The program’s stack and data values can be then be examined, data values possibly modified, and program execution continued until the program encounters another breakpoint location, faults, or terminates

In principle a breakpoint is a special instruction (trap instruction) inserted at the desired location in the program by the debugger or replacing the instruction (or instructions) there When the trap instruction is executed it causes a fault that is detected by the operating system The operating system then informs the debugger that a fault has occurred Then the debugger inspects the type of fault and reports it to the programmer

Different debuggers support breakpoint setting in different ways [Paxs 90] due to different kinds of

breakpoint There are two common types of breakpoints: conditional breakpoint and data breakpoint (watchpoint) [Paxs 90] Conditional breakpoints are triggered when reached only if a particular condition is

true, while data breakpoints are triggered when a memory location is read or written A simple type of

conditional breakpoint is one that interrupts execution after the breakpoint has been reached n times An example of a data breakpoint is that the execution is stopped if variable a is modified Another kind of conditional breakpoint is a control breakpoint [WaGr 93], which specifies the breakpoint condition in terms

of the program’s control flow For example, the execution is stopped when it calls function main In

Trang 29

addition, breakpoint sometimes requires support from hardware [Paxs 90] For instance, hardware page protect can be used to implement watchpoints by making pages read-only (to catch modifications) or inaccessible (to catch any form of access)

Setting breakpoints in sequential programs is well understood; however, there are still many problems in the case of parallel programs In sequential programs there is only one thread of computation, and thus the execution of the thread should be stopped when the breakpoint is hit In parallel programs, several threads

or processes exist concurrently and thus different effects may occur when a breakpoint is hit According to how many processes will be stopped when a breakpoint is hit, Kacsuk classifies several kinds of breakpoints [Kacs 00] They are presented in Figure 2-5

A local breakpoint affects only the owner process, which means that only the owner process of the breakpoint hit is stopped A message breakpoint has an effect on the processes that participate in the same communication event [Wism 97] There are two kinds of global breakpoints: single global breakpoints and

global breakpoint sets Single global breakpoints are divided into complete global breakpoints and partial global breakpoints All processes will be stopped if a complete global breakpoint is used Of course the

owner process is stopped at the breakpoint, while other processes are stopped at non-defined points of their execution depending on the relative speed of processes and the debugging system In contrast to complete global breakpoints, only restricted processes are stopped in the case of partial global breakpoints The set of these processes is called the scope of the partial global breakpoint Other processes (outside the scope) can

run without any influence of the breakpoint The distributed breakpoint used in this thesis can be compared

with a (strongly complete) global breakpoint set The distributed breakpoint is a set of breakpoints (one on each process) and each process will be stopped at its breakpoint

During debugging programmers want sometimes to see the program in a state that reflects all events that have had a causal effect on the state of the program at the breakpoint With a sequential program this is straightforward, since only events earlier in time have a causal effect on events later in time, and sequential execution totally orders all events In parallel programs, it is more complex since events are only partially ordered Fortunately, the conventional notion of a breakpoint in a sequential program can be retained in

parallel programs through causal distributed breakpoints [FoZw 90] A causal distributed breakpoint

restores each process to the earliest state that reflects all events that happened before the breakpoint was reached, according to Lamport’s partial order of events in a distributed system [Lamp 78] In terms of a breakpoint in the breakpoint process, the procedure is as follows:

1 The breakpoint process is stopped at the well-defined breakpoint, and

2 Other processes are stopped at the earliest states that reflect all events in that processes that happened before the breakpoint event

The causal distributed breakpoint is useful in debugging and is usually tracked down by using means of

a dependency vector [FoZw 90] or vector time [Bast 94]

Cyclic debugging is a traditional debugging method, in which the program is executed repeatedly, enabling a programmer to collect more information and narrow down his/her search for the suspected error

It usually requires some additional techniques such as memory dump, tracing and breakpoints [TsYa 95]

In memory dump, the program status, including program object code, register contents and memory contents, is dumped into a special memory area whenever the system is terminated abnormally or by programmer’s request It provides sufficient information to find the error Unfortunately, programmers must have a strong background in machine-language to examine the dumped code

In another solution, the special tracing facilities are embedded in the compiler or the operating system to track and display every step of a normal execution This allows programmers to see control flow, data flow, variable content, and function calling sequence Furthermore, this technique gives programmers a sense of the step-by-step flow of program execution

Trang 30

BEGIN

instrument program

execute program set breakpoint

inspect state

bug detected?

correct error

END

yes

no

Figure 2-6 Breakpoint in cyclic debugging [Kran 00a]

As mentioned in Section 2.2.3, breakpointing allows programmers to halt and to examine a program at points of interest in its execution It is usually used in cyclic debugging Kranzlmüller shows such a use of breakpointing in [Kran 00a], which is presented in Figure 2-6 It starts by instrumenting the target program This means, that debugging code is added to the source code in order to observe the program’s execution Afterwards the program is analyzed iteratively to detect the lines of faulty code Although it depends on the tool applied, there are usually three activities taking place:

• Setting a breakpoint

• Executing the program as far as the breakpoint

• Analyzing the process state at the breakpoint

Obviously, if an error is detected at an observation, the reason for this incorrect behavior must be placed

at the observation point or somewhere before the observation point Then the debugging process is started,

in order to deduce the conditions under which the program produced the incorrect output Due to its characteristics, this process resemables a “backward” process or “flowback” analysis in practice [Balz 69] There are two ways to do this:

• Reverse execution

Trang 31

• Backtracking with re-execution

Following up an idea of Choi, Miller and Netzer in [ChNe 91], the best solution would be to execute the program backwards one step at a time, starting at the observation point up to the point of the bug’s origin

By comparing the states of the program at each of these steps, the error would be localized between the point where the state of the program was what was expected and the first point where the state was not what was expected [Paul 99] Unfortunately, this is not feasible in reality The reason is that it is practically difficult and usually impossible to execute a program in reverse, because many statements in a program and therefore the computing processes are irreversible [FeBr 89]

Another approach is to implement backtracking using re-execution, where the program’s statements are executed in their original order The goal is the same as for reverse execution, to track the processes’ history backwards in time from the manifestation of the error (the observation point) to the point at which the bug initially caused the erroneous behavior [ChNe 91] In order to detect an error, we need to investigate the history of the process and collect sufficient information To achieve this goal, a breakpoint is positioned somewhere in the code before the observation point Since it is a priori unknown, if the breakpoint is really adequate, it can only be positioned at a suspected error location Afterwards the program is executed until this breakpoint is reached At this suspected error location the process state can

be analyzed and information about the program’s behavior can be gained If this information is sufficient, and the bug has been detected, the debugging cycle is stopped and the faulty lines of code can be corrected Otherwise, the error may be located either before or after the selected breakpoint location, which defines the new suspected error location If the current breakpoint is placed after the new suspected error location, the program has to be restarted and executed again up to the new breakpoint On the contrary, if the current breakpoint is before the new suspected error location, it is usually easy to continue execution from the current position

The above techniques are used in some debugging tools such as: IGOR [FeBr 89], Spyder [AgSp 91], PPD [ChNe 91] and pSather [Mina 94] IGOR [FeBr 89] is a tool that supports reversible execution by performing periodic checkpointing of memory pages or file blocks modified during program execution Spyder [AgSp 91] incorporates backtracking as well as dynamic program slicing techniques PPD [ChNe 91] also supports flowback analysis and tries to keep both execution-time and debug-time overhead low The technique used is incremental tracing, where only a small amount of trace data is generated during program execution, but the logged data is supplemented during follow-up debugging cycles by detailed information according to the needs of the debugging user Cyclic debugging and breakpointing are also used in a parallel debugger for the parallel object-oriented language pSather [Mina 94] Breakpoints are set for threads, and threads are single-stepped, while the debugger has to “follow” the thread whenever it migrates to another processor

The debugging techniques described can be used for both sequential and parallel programs Due to the complexity of parallel programs, the assistance of some additional techniques is required In principle there are three main reasons why parallel debugging differs from its sequential counterpart [Kran 00a]:

• Increased complexity,

• Amount of debugging data,

• Additional anomalous effects

Due to the large number of processes involved in a parallel program, it is more complex than a sequential program To debug a parallel program, the programmer has to work with many processes at the same time The process of receiving and analysing information is thus more difficult An example is that the process to display and manage the inherent complexity of parallel programs is not easy [Panc 96] The potentially huge amount of storage required for debugging data that has to be provided and analyzed

is another additional problem For massively parallel programs the trace data are usually very extensive,

Trang 32

and the process of detecting errors is difficult because so many data are irrelevant This is so-called

“maze_effect” [Damo 94] Although this is also possible in sequential programs, it is much more likely in parallel programs Therefore, a debugging tool must allow programmers to concentrate on the essential details, instead of flooding the user with all the possible data [NeDa 96]

Synchronization of and communication between concurrent processes are also problems during debugging parallel programs While the multiplication of sequential bugs may already be a major obstacle, the existence of anomalous effects due to concurrency makes parallel debugging even worse These effects

in parallel programs are due to the increased influence of the underlying parallel hardware system, such as processor speed, load, cache contents, cache conflicts, network throughput and network conflicts, etc Cyclic debugging is well_understood with sequential programs However, this approach often fails with parallel programs because the undesirable behavior may not appear when the program is executed [McHe 89] These problems and solutions are discussed below

2.3.1 Nondeterministic Behavior of Parallel Programs

Nondeterministic programs are contrasted with deterministic programs We often see deterministic

behavior in sequential programs “Deterministic” is defined as follows:

Definition 2-6: Determinism [Enge 88] (used in [Kran 00a])

A program executed by a machine is called deterministic if there exists exactly and only one follow-up state after each programming statement This means that the follow-up statement of any statement is always uniquely defined In addition, every deterministic program must terminate

This means that, given the same input and a particular instruction I k in a deterministic program, the

instruction which will be executed after the execution of I k, is deterministic in any execution To debug deterministic programs, cyclic debugging is a solution Unfortunately, some programs do not satisfy the definition of “deterministic” These are called nondeterministic programs and defined as follows:

Definition 2-7: Nondeterministic programs [Kran 00a]

A program is nondeterministic if - for a given input - there may be situations where an arbitrary programming statement is succeeded by one of two or more follow-up states This freedom of choice may be determined by pure chance or unawareness of the complete state of the execution environment

Definition 2-8: Nondeterministic programs [NeMi 92b] (used in [Kran 00a])

Nondeterministic programs do not fully specify all possible execution sequences, but allow a degree of freedom in selecting subsequent program states Consequently, a program is nondeterministic, if successive executions of the same program may yield different results although the same input is provided

The execution paths of a nondeterministic program may differ from execution to execution - even the final results may differ Obviously, different execution paths yield different intermediate results However,

it may be possible to obtain correct final results with different intermediate results for any nondeterministic

program For example, process P 0 receives and calculates the total of values from process P 1 and process

P 2 Furthermore, the order of values coming from process P 1 or P 2 may be different The execution paths are thus different but the final results are the same If the result of the program is important, Definition 2-8

is valid However, it sometimes contains errors that users do not see in normal cases Such programs may

Trang 33

even operate correctly for long periods of time, until suddenly incorrect results are revealed In the remainder of this thesis, we use Definition 2-7 for nondeterministic programs

Thinking along similar lines, Ronsse et al introduce two types of nondeterminism: external

nondeterminism and internal nondeterminism [RoCh 00b] External nondeterminism means that an

application returns different results for repeated executions with the same input data, while internal nondeterminism means that repeated executions with the same input yield the same results, but the internal execution paths are different

It is also important to distinguish between intended nondeterminism and unintended nondeterminism

For the sake of performance and/or efficiency the programmer may deliberately accept nondeterminism Thus the programmer should be perfectly aware of the nondeterministic nature of his program (and the consequences) But in the case of unintended nondeterminism, which is very often a result of

“programmer’s laziness”, the situation is much more dangerous The programmer is not aware of the effects of the nondeterminism he has introduced, and hence cannot consider the special requirements necessary for testing nondeterministic programs

side-There are many factors that lead to nondeterministic behavior of a program A random number

generator is a simple example Obviously a random generator gives different values in different executions

Another source is input values from system calls such as gettimeofday, getpid, etc

All the above factors can be found in both sequential and parallel programs Besides these sources, parallel programs have other sources, which never exist in sequential programs In shared-memory environments, the order of actions to access shared resources such as “read” and “write” of multiple threads/processes may differ due to the timing This affects the behavior of the reading thread/process dramatically, depending on whether it reads the old or the new value

In message-passing programs, the source of nondeterministic behavior is the order of incoming

messages and wild-card receive, which is supported in most communication libraries If a wild-card is used

in a receive operation, the process accepts a message from any source An example of wild-card is MPI_ANY_SOURCE in MPI programs [MPI 95] A program becomes nondeterministic as soon as there is

at least one wild-card receive in it, where at least two messages can be accepted In such a case it is not predictable which message will be accepted at the receive

Events representing actions that lead to nondeterministic behavior of a program as mentioned above are

called nondeterministic events

The concept of race condition is introduced in the context of nondeterminism, where the source of nondeterminism is communication or synchronization Race conditions can be found in both shared-memory and message-passing parallel programs

a

w

(b) (a)

Figure 2-7 Race in shared-memory programs

Trang 34

For shared-memory programs, synchronization between processes or threads is what leads to a race condition Netzer and Miller have defined race conditions as follows:

Definition 2-9: Shared memory race condition [NeMi 92a]

In shared-memory parallel programs that use explicit synchronization, race conditions result when accesses to shared memory are not properly synchronized

Informally, a race exists between two events if they conflict, e.g one reads from and the other writes to the same memory location, and their execution order depends on how the threads (or tasks) are scheduled

For example, in Figure 2-7, two orders of accessing a variable a shared between processes P 0 and P 1 are

shown in Figure 2-7(a) and Figure 2-7(b) The values read by process P 1 are different They depend on the

order of “write” and “read” operations on processes P 0 and P 1 This kind of race is also called “race data” Depending on the ordering relations that may be observed during execution, Helmbold and McDowell introduced four disjoint kinds of race: concurrent races, general races, unordered races, and omission races (see details in [HeMc 96]) In addition, Netzer and Miller also classified three different types of data races: actual, apparent, and feasible (See details in [NeMi 92a])

Similarly to the shared-variable paradigm, race conditions also exist in message-passing programs due to communication and synchronization In the message-passing model, processes interact with others via messages The order of incoming messages may change the intermediate states and even the final results of

a process This kind of race can be defined as follows:

Definition 2-10: Message-passing race condition

A race condition in message-passing programs occurs if two or more messages can arrive at a particular receive operation simultaneously and any one of them can be accepted

P0

P2 P1

m1

m2

a b

(b) (a)

Figure 2-8 Race condition in message-passing programs

When the received message of a receive statement is undefined, the order in which messages are delivered may different as between executions The program is thus nondeterministic As mentioned above,

a receive statement may accept any incoming message if it uses wild-card receive For example, two orders

of received messages are presented in Figure 2-8 Messages m 1 and m 2 have arrived at process P 1 before

receive statement a is started Depending on which comes first, statement a will accept that message There

are thus two possible situations, shown in Figure 2-8(a) and Figure 2-8(b) Due to the values attached to

messages m 1 and m 2 , the final results of process P 1 may be different in the two cases

Among the reasons for messages arriving in different orders in executions environmental factors such as processor speed, processor load, scheduling, etc may be involved These factors affect the send time of a message In addition, other factors inherent in the communication medium, e.g network contention, nondeterminism, also affect message transfer time Therefore, if two send events produce messages to the

Trang 35

same destination and there is no causal link between them, which message reaches the destination first is undefined

2.3.3 The Irreproducibility Effect

Nondeterministic behavior of parallel programs can cause serious problems during testing and debugging: subsequent executions of the program may not reproduce the original bug with the same input This effect

is called the irreproducibility effect [SnHo 88] or non-repeatability effect [NeMi 92b]

The irreproducibility effect makes it difficult to use traditional debugging techniques which require repeated execution such as cyclic debugging It is quite a problem to verify that a certain bug has been removed In addition, the disappearance of certain bugs and the appearance of other new bugs on different executions of the same program will confuse the programmer [Kran 00a]

Ways of coping with the irreproducibility effect are offered by record&replay mechanisms (see

Section 2.4), which provide any number of equivalent re-executions based on some previously performed executions

2.3.4 The Completeness Problem

Another problem in testing is the completeness problem Due to nondeterminism of parallel programs, there

is more than one execution path or result with the same input Certainly, to ensure that the program is completely tested, it is necessary to test all possible execution paths The problem is how to know all possible execution paths or results with a certain input Unfortunately, obtaining all possible executions of nondeterministic parallel programs is difficult or even impossible [Kran 00a] Thus it is difficult to ensure completeness in testing

In contrast to the irreproducibility problem, only a few solutions address the completeness problem Kranzlmüller et al proposed a solution by event manipulation, which make it possible to modify the order

of events on the basis of a previously observed execution [KrVo 95] The order of events can be manipulated during artificial replay [Scha 00], so that different execution paths can be evaluated If the manipulation is executed automatically, every possible execution of a nondeterministic program can be obtained Another approach is macrostep debugging [Kacs 98], in which a tool automatically derives all possible executions incrementally and constructs the execution tree [Kacs 00]

In debugging, programs are often monitored and useful information is extracted during their execution to examine later Therefore, it is easy to violate the Heisenberg principle (see Section 2.2.2), which requires that the behavior of the program should not be changed due to additional work This principle is connected with the so-called “probe effect” [McHe 89]

Definition 2-11: Probe effect [McHe 89]

Probe effect refers to the fact that any attempt to observe the behavior of a distributed system (or a parallel program) may change the behavior of that system (or program)

In debugging tools, instrument code inserted in the program code will cause a certain amount of delay and occupy a certain amount of memory This is known as monitor overhead A primary goal of any debugging tool must be to generate very little overhead However, it always affects the behavior of the target program and modifies the time of occurrence of events Once the behavior of the program is changed,

a particular error may disappear Furthermore, other errors may appear, which are also called “Heisenbugs” [LePa 85] [RoCh 00b]

Trang 36

Avoiding the probe effect is an important aspect of debugging Therefore it has been discussed and solved, and we can find plenty of literature about this [MaWi 92] [SaMa 93] [GaCa 94] [Mail 95] However, there are still problems with nondeterministic behavior of parallel programs Consequently, Leu and Schiper [LeSc 92], and later Teodorescu and Chassin de Kergommeaux [TeCh 97] introduced a new solution, in which only a minimum of trace data is required to allow re-execution of programs Thus the initial execution is assumed to be only slightly disturbed, and afterwards the replayed executions are used to collect more information Using a similar approach, Kranzlmüller et al collect the correct event occurrence timings in the initial execution and replay the program with clock synchronization algorithms [KrSc 99] [ScVo 99]

Another class of solutions is designed to prevent the probe effect The first approach in this class is using a “logical clock” Real time is not used because it is affected by monitoring work Instead of using real time, Cai and Turner use a virtual time to reflect the real-time execution of each process when running without monitoring [CaTu 94] Another solution helps to correct the monitor intrusion with artificial delays,

so that the relative speeds of all the processes remain the same [LiZh 95] [ZhLi 98] Similarly, Wu et al distinguish between various kinds of monitoring intrusion, such as communication intrusion, scheduling intrusion and execution intrusion [WuGu 96]; and they propose a method of distinguishing monitoring activities from other computational tasks in order to allow automatic removal of monitoring intrusion [WuGu 96] [WuSp 98] With the aim of keeping the program’s behavior nearly unaffected by intrusion, Hollingsworth and Miller [HoMi 96] constructed a data collection cost system that provides users with feedback about the impact of the data collection on the running program In addition, users can define the perturbation their application can tolerate, and then the system can regulate the amount of instrumentation

so that monitor overhead does not exceed the threshold defined

Due to the nondeterministic behavior of parallel programs, how to use cyclic debugging tools on them is a challenge If there is no gurantee that a re-execution will produce the same error as in the initial execution, cyclic debugging is useless A solution to this problem is provided by “record&replay” mechanisms or

“replay” methods This is a two-step approach The first step is a record phase, in which data related to

nondeterministic events are stored to trace files Afterwards, these trace data are used as a constraint for the

program in the second step, called the replay phase, to produce any number of equivalent executions

There are two ways to force executions to be equivalent to the initially traced execution The first is

data-driven [Kran 00a] or content-based (data-based) [RoCh 00b] replay, in which processes are forced to

read the same values of shared variables or to receive the same messages as in the initial execution by recording the original values Data-driven replay methods were introduced early on Curtis and Wittie [CuWi 82] proposed a technique to record the contents of each message when it is received by the corresponding process Furthermore, Pan and Linton [PaLi 89] introduced a content-based replay method for shared-memory parallel programs, in which the data read from every shared-memory location are traced The advantage of this technique is that a single process can be replayed in isolation However, the main drawback of this technique is the requirement of significant monitor and storage overhead Furthermore, it does not show the interactions between the different processes, and thus it hinders the task

of finding the cause of a bug [LeMe 87] However, it is still used for tracing I/O or for tracing the results of certain system calls such as gettimeofday(), random()

The second replay technique is control-driven [Kran 00a] or ordering-based [RoCh 00b] This technique

is based on the piecewise deterministic (PWD) execution model [StYe 85]: process execution is divided

into a sequence of state intervals, each of which is started by a nondeterministic event Execution within an interval is completely deterministic Under the PWD assumption, execution of each deterministic interval depends only on the sequence of nondeterministic events that preceded the interval’s beginning The equivalent executions are thus ensured if the ordering of nondeterministic operations on all processes is the same as in the initial execution

Trang 37

An example of control-driven technique is Instant Replay by LeBlanc and Mellor-Crummey [LeMe 87],

which can be applied to both shared-memory and message-passing programs Following the PWD execution model, if each process is supplied with the same input values, such as the contents of messages received or the values of shared memory locations referenced, in the same order during successive executions, it will produce the same behavior each time Consequently each process will produce the same output values in the same order These output values may then serve as input values for other processes Therefore we need only trace the relative order of significant events instead of the data associated with these events The advantage of this technique is that it requires less time and space to save the information needed for replay

The control-driven technique is implemented in several record&replay methods for both shared-memory

and message-passing programs In shared-memory environment, an approach called repeatable scheduling

algorithm [RuCo 96] gives deterministic replay on shared memory uni-processor systems by forcing the

system to make the same scheduling decisions during re-executions as during the initial execution In a message-passing environment, several solutions are implemented for PVM programs [Mack 93] [XiSh 96], for MPI programs [ClRu 95] [HoCh 96] [ChBo 99], and for a hybrid parallel/distributed system [RoCh 00a] This technique is used in many debugging tools, such as Panorama [MaBe 93], IVD [Mack 93], PDT [ClRu 95], Buster [XiSh 96], DDB [SiRa 97], and DCDB [ZhHu 00]

To improve the control-driven replay technique, Netzer and Miller [NeMi 92b] proposed an optimal tracing and replay method They try to minimize the amount of trace data needed, both for message-passing systems [NeMi 92b] and for shared-memory programs [Netz 93] The key technique is that only those events affecting the race conditions have to be traced Based on this idea, there are several efforts to improve this technique, such as improvements in time efficiency [FaCh 95] [FaCh 96], and improvements

in space efficiency [LeSc 92] [LeSc 93] To reduce the space needed for logging data, compression algorithms are used [RoLe 95] In addition, Ronsse and Kranzlmüller proposed a technique to replay message-passing programs with a minimum of trace data and less intrusively [RoKe 98]

To compare with the Instant Replay method, several other replay methods have been developed Carver

and Tai [TaOb 91] [CaTa 91] introduced Deterministic Execution This solution is applied to both

semaphore-based and monitor-based programs, where a tool helps to track down the sequence of synchronous events; and by inserting these synchronization constructs into the replayed execution, the

irreproducibility effect can be eliminated In order to minimize the probe effect in debugging, the on-the-fly

replay [GeZa 94] is introduced Instead of collecting information in a record phase, a twin execution is created, in which one gets information from the other to ensure equivalent execution In this solution record and replay phases are executed simultaneously

The Incremental Replay proposed by Netzer and Xu [NeXu 93] is another approach This replay

technique uses checkpointing to allow programmers to run any part (or interval) of a process intermediately Checkpointing improves time efficiency Programmers do not wait for the process to be run from the initial state, and thus the waiting time is reduced In addition, it also limits the time taken in the replay process [NeWe 94], which means that the replay time of an interval does not exceed an upper bound Cyclic debugging is more useful when using Incremental Replay However, Incremental Replay only allows one to replay one interval of one process each time Each checkpoint can be seen as a breakpoint Programmers can reach any breakpoint on a process immediately but they cannot examine the interactions between processes In other words, Incremental Replay rules out using distributed breakpoints This problem is solved by our replay method We support using distributed breakpoints for both all processes and group of processes Furthermore, the waiting time to reach any distributed breakpoint is short

In this chapter, the need for high-performance computing is described Due to many large-scale applications

in science and economics, we need more powerful supercomputers In addition, developing software to run

on these supercomputers is important and necessary One aspect of this is testing and debugging process

Trang 38

Cyclic debugging, which is a useful traditional debugging method, is introduced We are also interested

in the possibility of using this method for parallel programs Therefore some ways to set breakpoints in parallel programs are examined In addition, other problems arising with parallel programs, e.g nondeterministic behavior, irreproducibility, and the probe effect, are discussed

Due to nondeterministic behavior of parallel programs, record&replay methods are proposed Hence, a survey of record&replay methods is given We are especially interested in applying these record&replay methods to long-running, large-scale parallel programs Unfortunately, no current record&replay method ensures that any distributed breakpoint in parallel programs can be reached with a minimal waiting time This is a hindrance to developing useful debugging tools, where some degree of interactivity for the users investigations is required This problem will be solved in this thesis

Trang 39

Chapter 3 Event Graph Model

This chapter introduces a program model, so-called event graph model In debugging we need to investigate the program states and the definition of event is thus introduced We also discuss what constitutes an event and how its data are generated The event graph model is constructed on the basis of these notions about events This graph model represents not only the set of event vertices but also (and this is important) the relations between them

Kranzlmüller proposed the event graph model in [Kran 00a] In this dissertation, some new kinds of events, e.g synchronous events and checkpoint events, are introduced

Trang 40

3 Event Graph Model

3.1 Events

In debugging, the behavior of a program is analysed to detect errors In this context, behavior identifies all the activities that have observable effects in an executing system [Bate 95] To analyse this behavior, the states of a program and their relations as they are established during program execution are investigated [Kran 00a] The results of this investigation may enable us to find the locations of errors

The state of a process consists of two parts: the control substate, which identifies the current point of

execution, and the data substate, which contains all data that are occupied by the process [Plat 84]2 The

program state of the program P is a set that includes all potential states of processes derived from P [Plat

84] The state of a program is changed during execution and it is the main information for debugging For example, when a process receives a message or makes a calculation, the state of the process (and the state

of the program) is changed, due to the new value in its memory Obviously, the number of the states of a process (or the number of the states of a program) in this case is too great, especially with long-running programs, because the execution of any instruction of the program may produce a new state Therefore, the idea that all states of a process (or a program) are stored and users could navigate through this data space backwards and forwards in time is impossible in implementation However, it is true that users may not be interested in and need all states of a process (or a program) and all contents of a process state (or a program state) during debugging This observation makes it possible to reduce the contents of a process state (and a program state) In addition, only the states of interest are stored

There are two ways to collect the states of a process (or a program) discretely: data are collected either

at predefined intervals or whenever certain actions are observed The former is often called profiling and used in performance analysis The state of a process (or a program) is inspected whenever a certain amount

of time has passed or a routine is executed completely The information about the state of a process (or a program) can be used to tune the program’s performance

In debugging, another approach is used: the state data from a program are collected whenever a predefined action is taken This method is often called tracing, recording, or logging Generally, it is called

“monitoring”

Definition 3-1: Monitoring [TsYa 95]

Monitoring is the recording of specified event occurrences during program execution in order

to gain runtime information that cannot be obtained merely by studying the program text

The event occurrences to be monitored are called the events of interest or the specified events These events often occur discretely The monitoring facility is inactive until an event of interest occurs At this point the required data are traced to serve in subsequent analysis

Events are the basic components in the event graph model In principle an event is something that occurs during execution, and thus anything can be defined as an event [Kran 00a] We use the following event definition:

2

This state of a process differs from the definition of process state used in operating systems (ready, running, waiting, etc.)

Định dạng
Số trang	147
Dung lượng	0,95 MB