Parallel Programming: for Multicore and Cluster Systems- P21 pptx

Data exchange must be performed by message-passing: To transfer data from the local memory of one processor A to the local memory of another processor B, A must send a message containing

Trang 1

Fig 4.9 Illustration of the

parameters of the LogP

P P

P overhead o

latency L

overhead o

P processors

interconnection network

Figure 4.9 illustrates the meaning of these parameters [33] All parameters except

P are measured in time units or as multiples of the machine cycle time Furthermore

it is assumed that the network has a finite capacity which means that between any pair of processors at most [L/g] messages are allowed to be in transmission at

any time If a processor tries to send a message that would exceed this limit, it is blocked until the message can be transmitted without exceeding the limit The LogP

model assumes that the processors exchange small messages that do not exceed

a predefined size Larger messages must be split into several smaller messages

The processors work asynchronously with each other The latency of any single message cannot be predicted in advance, but is bounded by L if there is no blocking

because of the finite capacity This includes that messages do not necessarily arrive

in the same order in which they have been sent The values of the parameters L , o, and g depend not only on the hardware characteristics of the network, but also on

the communication library and its implementation

The execution time of an algorithm in the LogP model is determined by the maxi-mum of the execution times of the participating processors An access by a processor

P1to a data element that is stored in the local memory of another processor P2takes time 2· L + 4 · o; half of this time is needed to bring the data element from P2 to P1,

the other half is needed to bring the data element from P1 back to P2 A sequence

of n messages can be transmitted in time L + 2 · o + (n − 1) · g, see Fig 4.10.

A drawback of the original LogP model is that it is based on the assumption that the messages are small and that only point-to-point messages are allowed More complex communication patterns must be assembled from point-to-point messages

Fig 4.10 Transmission of a

larger message as a sequence

of n smaller messages in the

LogP model The

transmission of the last

smaller message is started at

time (n − 1) · g and reaches

its destination 2· o + L time

units later

g

L

g

o

time

Trang 2

4.6 Exercises for Chap 4 193

o L

G G G G

o o

(n–1)G L

g

Fig 4.11 Illustration of the transmission of a message with n bytes in the LogGP model The

transmission of the last byte of the message is started at time o + (n − 1) · G and reaches its

destination o + L time units later Between the transmission of the last byte of a message and the

start of the transmission of the next message at least g time units must have elapsed

To release the restriction to small messages, the LogP model has been extended to

the LogGP model [10], which contains an additional parameter G (Gap per byte).

This parameter specifies the transmission time per byte for long messages 1/G is the bandwidth available per processor The time for the transmission of a message

with n bytes takes time o + (n − 1)G + L + o, see Fig 4.11.

The LogGP model has been successfully used to analyze the performance of message-passing programs [9, 104] The LogGP model has been further extended

to the LogGPS model [96] by adding a parameter S to capture synchronization that must be performed when sending large messages The parameter S is the threshold

for the message length above which a synchronization between sender and receiver

is performed before message transmission starts

4.6 Exercises for Chap 4

Exercise 4.1 We consider two processors P1 and P2 which have the same set of

instructions P1 has a clock rate of 4 GHz, P2 has a clock rate of 2 GHz The

instructions of the processors can be partitioned into three classes A, B, and C.

The following table specifies for each class the CPI values for both processors We

assume that there are three compilers C1, C2, and C3available for both processors

We consider a specific program X All three compilers generate machine programs

which lead to the execution of the same number of instructions But the instruction classes are represented with different proportions according to the following table:

Class CPI for P1 CPI for P2 C1 (%) C2 (%) C3 (%)

(a) If C1is used for both processors, how much faster is P1than P2?

(b) If C2is used for both processors, how much faster is P2than P2?

Trang 3

(c) Which of the three compilers is best for P1?

(d) Which of the three compilers is best for P2?

Exercise 4.2 Consider the MIPS (Million Instructions Per Second) rate for

esti-mating the performance of computer systems for a computer with instructions

I1, , I m Let p k be the proportion with which instruction I k(1 ≤ k ≤ m) is represented in the machine program for a specific program X with 0 ≤ p k≤ 1 Let CPIk be the CPI value for I k and let t cbe the cycle time of the computer system in nanoseconds (10−9)

(a) Show that the MIPS rate for program X can be expressed as

( p1· CPI1 + · · · + p m CPI m)· t c[ns].

(b) Consider a computer with a clock rate of 3.3 GHz The CPI values and

propor-tion of occurrence of the different instrucpropor-tions for program X are given in the

following table

Integer add and subtract 18.0 1

Integer multiply and divide 10.7 9

Floating-point add and subtract 3.5 7

Floating-point multiply and divide 4.6 17

Compute the resulting MIPS rate for program X

Exercise 4.3 There is a SPEC benchmark suite MPI2007 for evaluating the MPI

performance of parallel systems for floating-point, compute-intensive programs Visit the SPEC web page atwww.spec.organd collect information on the bench-mark programs included in the benchbench-mark suite Write a short summary for each of the benchmarks with computations performed, programming language used, MPI usage, and input description What criteria were used to select the benchmarks? Which information is obtained by running the benchmarks?

Exercise 4.4 There is a SPEC benchmark suite to evaluate the performance of

par-allel systems with a shared address space based on OpenMP applications Visit the SPEC web page atwww.spec.organd collect information about this benchmark suite Which applications are included and what information is obtained by running the benchmark?

Exercise 4.5 The SPEC CPU2006 is the standard benchmark suite to evaluate the

performance of computer systems Visit the SPEC web page atwww.spec.org and collect the following information:

Trang 4

4.6 Exercises for Chap 4 195 (a) Which benchmark programs are used in CINT2006 to evaluate the integer performance? Give a short characteristic of each of the benchmarks

(b) Which benchmark programs are used in CFP2006 to evaluate the floating-point performance? Give a short characteristic of each of the benchmarks

(c) Which performance results have been submitted for your favorite desktop com-puter?

Exercise 4.6 Consider a ring topology and assume that each processor can transmit

at most one message at any time along an incoming or outgoing link (one-port com-munication) Show that the running time for a single-broadcast, a scatter operation,

or a multi-broadcast takes timeΘ(p) Show that a total exchange needs time Θ(p2)

Exercise 4.7 Give an algorithm for a scatter operation on a linear array which sends

the message from the root node for more distant nodes first and determine the asymptotic running time

Exercise 4.8 Given a two-dimensional mesh with wraparound arrows forming a

torus consisting of n ×n nodes Construct spanning trees for a multi-broadcast

oper-ation according to the construction in Sect 4.3.2.2, p 174, and give a corresponding

algorithm for the communication operation which takes time (n2− 1)/4 for n odd and n2/4 for n even [19].

Exercise 4.9 Consider a d-dimensional mesh network with√d

p processors in each

of the d dimensions Show that a multi-broadcast operation requires at least

(p−1)/d steps to be implemented Construct an algorithm for the implementation

of a multi-broadcast that performs the operation with this number of steps

Exercise 4.10 Consider the construction of a spanning tree in Sect 4.3.2, p 173, and

Fig 4.4 Use this construction to determine the spanning tree for a five-dimensional hypercube network

Exercise 4.11 For the construction of the spanning trees for the realization of a

multi-broadcast operation on a d-dimensional hypercube network, we have used

the relation

d

k− 1

− d ≥ d

for 2 < k < d and d ≥ 5, see Sect 4.3.2, p 180 Show by induction that this

relation is true

Hint : I ti s

d

k− 1

=

d− 1

k− 1

+

d− 1

k− 2

.

Exercise 4.12 Consider a complete binary tree with p processors [19].

a) Show that a single-broadcast operation takes timeΘ(log p).

b) Give an algorithm for a scatter operation with timeΘ(p) (Hint: Send the more

distant messages first.)

c) Show that an optimal algorithm for a multi-broadcast operation takes p− 1 time steps

Trang 5

d) Show that a total exchange needs at least timeΩ(p2) (Hint: Count the number

of messages that must be transmitted along the incoming links of a node.) e) Show that a total exchange needs at most timeΩ(p2) (Hint: Use an embedding

of a ring topology into the tree.)

Exercise 4.13 Consider a scalar product and a matrix–vector multiplication and

derive the formula for the running time on a mesh topology

Exercise 4.14 Develop a runtime function to capture the execution time of a parallel

matrix–matrix computation C = A · B for a distributed address space Assume a hypercube network as interconnection Consider the following distributions for A and B:

(a) A is distributed in column-blockwise, B in row-blockwise order.

(b) Both A and B are distributed in checkerboard order.

Compare the resulting runtime functions and try to identify situations in which one

or the other distribution results in a faster parallel program

Exercise 4.15 The multi-prefix operation leads to the effect that each participating

processor P jobtains the valueσ + j−1

i=1σ i where processor P icontributes valuesσ i

andσ is the initial value of the memory location used, see also p 188 Illustrate the

effect of a multi-prefix operation with an exchange diagram similar to those used in Sect 3.5.2 The effect of multi-prefix operations can be used for the implementation

of parallel loops where each processor gets iterations to be executed Explain this usage in more detail

Trang 6

Chapter 5

Message-Passing Programming

The message-passing programming model is based on the abstraction of a parallel computer with a distributed address space where each processor has a local memory

to which it has exclusive access, see Sect 2.3.1 There is no global memory Data exchange must be performed by message-passing: To transfer data from the local

memory of one processor A to the local memory of another processor B, A must send a message containing the data to B, and B must receive the data in a buffer

in its local memory To guarantee portability of programs, no assumptions on the topology of the interconnection network is made Instead, it is assumed that each processor can send a message to any other processor

A message-passing program is executed by a set of processes where each process has its own local data Usually, one process is executed on one processor or core of the execution platform The number of processes is often fixed when starting the program Each process can access its local data and can exchange information and data with other processes by sending and receiving messages In principle, each of

the processes could execute a different program (MPMD, multiple program multiple data) But to make program design easier, it is usually assumed that each of the processes executes the same program (SPMD, single program, multiple data), see

also Sect 2.2 In practice, this is not really a restriction, since each process can still execute different parts of the program, selected, for example, by its process rank The processes executing a message-passing program can exchange local data

by using communication operations These could be provided by a communication library To activate a specific communication operation, the participating processes call the corresponding communication function provided by the library In the

sim-plest case, this could be a point-to-point transfer of data from a process A to a process B In this case, A calls a send operation, and B calls a corresponding receive

operation Communication libraries often provide a large set of communication functions to support different point-to-point transfers and also global communica-tion operacommunica-tions like broadcast in which more than two processes are involved, see Sect 3.5.2 for a typical set of global communication operations

A communication library could be vendor or hardware specific, but in most cases portable libraries are used, which define syntax and semantics of communication functions and which are supported for a large class of parallel computers By far the

T Rauber, G R¨unger, Parallel Programming,

DOI 10.1007/978-3-642-04818-0 5, C Springer-Verlag Berlin Heidelberg 2010

197

Trang 7

most popular portable communication library is MPI (Message-Passing Interface) [55, 56], but PVM (Parallel Virtual Machine) is also often used, see [63] In this

chapter, we give an introduction to MPI and show how parallel programs with MPI can be developed The description includes point-to-point and global communica-tion operacommunica-tions, but also more advanced features like process groups and communi-cators are covered

5.1 Introduction to MPI

The Message-Passing Interface (MPI) is a standardization of a message-passing library interface specification MPI defines the syntax and semantics of library routines for standard communication patterns as they have been considered in Sect 3.5.2 Language bindings for C, C++, Fortran-77, and Fortran-95 are sup-ported In the following, we concentrate on the interface for C and describe the most important features For a detailed description, we refer to the official MPI doc-uments, seewww.mpi-forum.org There are two versions of the MPI standard: MPI-1 defines standard communication operations and is based on a static process model MPI-2 extends MPI-1 and provides additional support for dynamic process management, one-sided communication, and parallel I/O MPI is an interface spec-ification for the syntax and semantics of communication operations, but leaves the details of the implementation open Thus, different MPI libraries can use differ-ent implemdiffer-entations, possibly using specific optimizations for specific hardware platforms For the programmer, MPI provides a standard interface, thus ensuring the portability of MPI programs Freely available MPI libraries are MPICH (see www-unix.mcs.anl.gov/mpi/mpich2), LAM/MPI (seewww.lam-mpi org), and OpenMPI (seewww.open-mpi.org)

In this section, we give an overview of MPI according to [55, 56] An MPI pro-gram consists of a collection of processes that can exchange messages For MPI-1, a static process model is used, which means that the number of processes is set when starting the MPI program and cannot be changed during program execution Thus, MPI-1 does not support dynamic process creation during program execution Such

a feature is added by MPI-2 Normally, each processor of a parallel system executes one MPI process, and the number of MPI processes started should be adapted to the number of processors that are available Typically, all MPI processes execute the same program in an SPMD style In principle, each process can read and write data from/into files For a coordinated I/O behavior, it is essential that only one specific process perform the input or output operations To support portability, MPI programs should be written for an arbitrary number of processes The actual number

of processes used for a specific program execution is set when starting the program

On many parallel systems, an MPI program can be started from the command line The following two commands are common or widely used:

mpiexec -n 4 programname programarguments

mpirun -np 4 programname programarguments

Trang 8

5.1 Introduction to MPI 199 This call starts the MPI programprogramnamewith p = 4 processes The spe-cific command to start an MPI program on a parallel system can differ

A significant part of the operations provided by MPI is the operations for the exchange of data between processes In the following, we describe the most impor-tant MPI operations For a more detailed description of all MPI operations, we refer

to [135, 162, 163] In particular the official description of the MPI standard provides many more details that cannot be covered in our short description, see [56] Most examples given in this chapter are taken from these sources Before describing the individual MPI operations, we first introduce some semantic terms that are used for the description of MPI operations:

• Blocking operation: An MPI communication operation is blocking, if return of

control to the calling process indicates that all resources, such as buffers, spec-ified in the call can be reused, e.g., for other operations In particular, all state transitions initiated by a blocking operation are completed before control returns

to the calling process

• Non-blocking operation: An MPI communication operation is non-blocking, if

the corresponding call may return before all effects of the operation are com-pleted and before the resources used by the call can be reused Thus, a call of

a non-blocking operation only starts the operation The operation itself is

com-pleted not before all state transitions caused are comcom-pleted and the resources specified can be reused

The terms blocking and non-blocking describe the behavior of operations from the local view of the executing process, without taking the effects on other processes

into account But it is also useful to consider the effect of communication operations

from a global viewpoint In this context, it is reasonable to distinguish between synchronous and asynchronous communications:

• Synchronous communication: The communication between a sending process

and a receiving process is performed such that the communication operation does not complete before both processes have started their communication operation This means in particular that the completion of a synchronous send indicates not only that the send buffer can be reused, but also that the receiving process has started the execution of the corresponding receive operation

• Asynchronous communication: Using asynchronous communication, the

send-er can execute its communication opsend-eration without any coordination with the receiving process

In the next section, we consider single transfer operations provided by MPI, which are also called point-to-point communication operations

5.1.1 MPI Point-to-Point Communication

In MPI, all communication operations are executed using a communicator A

communicator represents a communication domain which is essentially a set of

Trang 9

processes that exchange messages between each other In this section, we assume that the MPI default communicatorMPI COMM WORLDis used for the communi-cation This communicator captures all processes executing a parallel program In Sect 5.3, the grouping of processes and the corresponding communicators are con-sidered in more detail

The most basic form of data exchange between processes is provided by point-to-point communication Two processes participate in this communication opera-tion: A sending process executes a send operation and a receiving process

exe-cutes a corresponding receive operation The send operation is blocking and has the

syntax:

int MPI Send(void *smessage,

int count, MPI Datatype datatype, int dest,

int tag, MPI Comm comm)

The parameters have the following meaning:

• smessagespecifies a send buffer which contains the data elements to be sent

in successive order;

• countis the number of elements to be sent from the send buffer;

• datatypeis the data type of each entry of the send buffer; all entries have the same data type;

• destspecifies the rank of the target process which should receive the data; each process of a communicator has a unique rank; the ranks are numbered from 0 to the number of processes minus one;

• tagis a message tag which can be used by the receiver to distinguish different messages from the same sender;

• commspecifies the communicator used for the communication

The size of the message in bytes can be computed by multiplying the number count of entries with the number of bytes used for typedatatype The tag parameter should be an integer value between 0 and 32,767 Larger values can be permitted by specific MPI libraries

To receive a message, a process executes the following operation:

int MPI Recv(void *rmessage,

int count, MPI Datatype datatype, int source,

int tag, MPI Comm comm, MPI Status *status)

This operation is also blocking The parameters have the following meaning:

Trang 10

5.1 Introduction to MPI 201

• rmessagespecifies the receive buffer in which the message should be stored;

• countis the maximum number of elements that should be received;

• datatypeis the data type of the elements to be received;

• sourcespecifies the rank of the sending process which sends the message;

• tagis the message tag that the message to be received must have;

• commis the communicator used for the communication;

• statusspecifies a data structure which contains information about a message after the completion of the receive operation

The predefined MPI data types and the corresponding C data types are shown in Table 5.1 There is no corresponding C data type forMPI PACKEDandMPI BYTE The typeMPI BYTErepresents a single byte value The typeMPI PACKEDis used

by special MPI pack operations

Table 5.1 Predefined data types for MPI

MPI LONG LONG INT long long int

MPI UNSIGNED CHAR unsigned char

MPI UNSIGNED SHORT unsigned short int

MPI UNSIGNED LONG unsigned long int

MPI UNSIGNED LONG LONG unsigned long long int

MPI LONG DOUBLE long double

MPI PACKED special data type for packing

By usingsource = MPI ANY SOURCE, a process can receive a message from any arbitrary process Similarly, by using tag = MPI ANY TAG, a process can receive a message with an arbitrary tag In both cases, thestatusdata structure contains the information, from which process the message received has been sent and which tag has been used by the sender After completion of MPI Recv(), statuscontains the following information:

• status.MPI SOURCEspecifies the rank of the sending process;

• status.MPI TAGspecifies the tag of the message received;

• status.MPI ERRORcontains an error code

Thestatusdata structure also contains information about the length of the mes-sage received This can be obtained by calling the MPI function

Định dạng
Số trang	10
Dung lượng	237,05 KB