1. Trang chủ
  2. » Công Nghệ Thông Tin

Parallel Programming: for Multicore and Cluster Systems- P27 doc

10 176 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Message-Passing Programming
Trường học University of Example
Chuyên ngành Computer Science
Thể loại Bài tập
Năm xuất bản 2023
Thành phố Example City
Định dạng
Số trang 10
Dung lượng 213,24 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Pthread data types Meaning pthread mutex t Mutex variable pthread cond t Condition variable pthread attr t Thread attributes object pthread mutexattr t Mutex attributes object pthread co

Trang 1

be used if the executing process will change the value of window entries using MPI Put()and if these entries could also be accessed by other processes

A shared lock is indicated by lock type = MPI LOCK SHARED This lock type guarantees that the following RMA operations of the calling process are

pro-tected from exclusive RMA operations of other processes, i.e., other processes are

not allowed to change entries of the window via RMA operations that are protected

by an exclusive lock But other processes are allowed to perform RMA operations

on the same window that are also protected by a shared lock

Shared locks should be used if the executing process accesses window entries only byMPI Get()orMPI Accumulate() When a process wants to read or manipulate entries of its local window using local operations, it must protect these local operations with a lock mechanism, if these entries can also be accessed by other processes

An access epoch started byMPI Win lock()for a windowwincan be termi-nated by calling the MPI function

int MPI Win unlock (int rank,

MPI Win win) whererankis the rank of the target process The call of this function blocks until all RMA operations issued by the calling process on the specified window have been completed both at the calling process and at the target process This guarantees that all manipulations of window entries issued by the calling process have taken effect

at the target process

Example The use of lock synchronization for the iterative computation of a

dis-tributed data structure is illustrated in the following example which is a variation

of the previous examples Here, an exclusive lock is used to protect the RMA operations:

while (!converged (A)) {

update(A);

update buffer(A, from buf);

MPI Win start(target group, 0, win);

for (i=0; i<num neighbors; i++) {

MPI Win lock(MPI LOCK EXCLUSIVE, neighbor[i], 0, win);

MPI Put(&from buf[i], size[i], MPI INT, neighbor[i], to disp[i],

size[i], MPI INT, win);

MPI Win unlock(neighbor[i], win);

}

}

5.5 Exercises for Chap 5

Exercise 5.1 Consider the following incomplete piece of an MPI program:

Trang 2

int rank, p, size=8;

int left, right;

char send buffer1[8], recv buffer1[8];

char send buffer2[8], recv buffer2[8];

MPI Comm rank(MPI COMM WORLD, 8 rank);

MPI Comm size(MPI COMM WORLD, & p);

left = (rank-1 + p) % p;

right = (rank+1) % p;

MPI Send(send buffer1, size, MPI CHAR, left, );

MPI Recv(recv buffer1, size, MPI CHAR, right, );

MPI Send(send buffer2, size, MPI CHAR, right, );

MPI Recv(recv buffer2, size, MPI CHAR, left, );

(a) In the program, the processors are arranged in a logical ring and each

processor should exchange its name with its neighbor to the left and its neighbor

to the right Assign a unique name to each MPI process and fill out the missing pieces of the program such that each process prints its own name as well as its neighbors’ names

(b) In the given program piece, theMPI Send()andMPI Recv() operations are arranged such that depending on the implementation a deadlock can occur Describe how a deadlock may occur

(c) Change the program such that no deadlock is possible by arranging the order of theMPI Send()andMPI Recv()operations appropriately

(d) Change the program such thatMPI Sendrecv()is used to avoid deadlocks (e) Change the program such thatMPI Isend()andMPI Irecv()are used

Exercise 5.2 Consider the MPI program in Fig 5.3 for the collection of distributed

data block with point-to-point messages The program assumes that all data blocks have the same sizeblocksize Generalize the program such that each process can contribute a data block of a size different from the data blocks of the other processes

To do so, assume that each process has a local variable which specifies the size of its data block

(Hint: First make the size of each data block available to each process in a

pre-collection phase with a similar communication pattern as in Fig 5.3 and then perform the actual collection of the data blocks.)

Exercise 5.3 Modify the program from the previous exercise for the collection of

data blocks of different sizes such that no pre-collection phase is used Instead, useMPI Get count() to determine the size of the data block received in each step Compare the resulting execution time with the execution time of the program

Trang 3

from the previous exercise for different data block sizes and different numbers of processors Which of the programs is faster?

Exercise 5.4 Consider the programGather ring()from Fig 5.3 As described

in the text, this program does not avoid deadlocks if the runtime system does not use internal system buffers Change the program such that deadlocks are avoided in any case by arranging the order of theMPI Send()andMPI Recv()operations appropriately

Exercise 5.5 The program in Fig 5.3 arranges the processors logically in a ring to

perform the collection Modify the program such that the processors are logically arranged in a logical two-dimensional torus network For simplicity, assume that all data blocks have the same size Develop a mechanism with which each processor

can determine its predecessor and successor in x and y directions Perform the col-lection of the data blocks in two phases, the first phase with communication in x direction, the second phase with communication in y direction.

In both directions, communication in different rows or columns of the processor

torus can be performed concurrently For the communication in y direction, each process distributes all blocks that it has collected in the x direction phase Use the

normal blocking send and receive operations for the communication Compare the resulting execution time with the execution time of the ring implementation from Fig 5.3 for different data block sizes and different numbers of processors Which of the programs is faster?

Exercise 5.6 Modify the program from the previous exercise such that non-blocking

communication operations are used

Exercise 5.7 Consider the parallel computation of a matrix–vector multiplication

A · b using a distribution of the scalar products based on a rowwise distribution of

A, see Fig 3.10, p 127 for a sketch of a parallel pseudo program Transform this

program into a running MPI program Select the MPI communication operations for the multi-broadcast operations appropriately

Exercise 5.8 Similar to the preceding exercise, consider a matrix–vector

multipli-cation using a distribution of the linear combinations based on a columnwise distri-bution of the matrix Transform the pseudo program from Fig 3.12, p 129 to a run-ning MPI program Use appropriate MPI operations for the single-accumulation and single-broadcast operations Compare the execution time with the execution time of the MPI program from the preceding exercise for different sizes of the matrix

Exercise 5.9 For a broadcast operation a root process sends the same data block to

all other processes Implement a broadcast operation by using point-to-point send and receive operations (MPI Send()andMPI Recv()) such that the same effect

as MPI Bcast() is obtained For the processes, use a logical ring arrangement similar to Fig 5.3

Exercise 5.10 Modify the program from the previous exercise such that two other

logical arrangements are used for the processes: a two-dimensional mesh and a

Trang 4

three-dimensional hypercube Measure the execution time of the three different ver-sions (ring, mesh, hypercube) for eight processors for different sizes of the data block and make a comparison by drawing a diagram UseMPI Wtime()for the timing

Exercise 5.11 Consider the construction of conflict-free spanning trees in a

d-dimensional hypercube network for the implementation of a multi-broadcast

opera-tion, see Sect 4.3.2, p 177, and Fig 4.6 For d = 3, d = 4, and d = 5 write an MPI

program with 8, 16, and 32 processes, respectively, that uses these spanning trees for a multi-broadcast operation

(a) Implement the multi-broadcast by concurrent single-to-single transfers along the spanning trees and measure the resulting execution time for different mes-sage sizes

(b) Implement the multi-broadcast by using multiple broadcast operations where each broadcast operation is implemented by single-to-single transfers along the usual spanning trees for hypercube networks as defined in p 174, see Fig 4.4 These spanning trees do not avoid conflicts in the network Measure the result-ing execution time for different message sizes and compare them with the exe-cution times from (a)

(c) Compare the execution times from (a) and (b) with the execution time of an MPI Allgather()operation to perform the same communication

Exercise 5.12 For a global exchange operation, each process provides a potentially

different block of data for each other process, see pp 122 and 225 for a detailed explanation Implement a global exchange operation by using point-to-point send and receive operations (MPI Send()andMPI Recv()) such that the same effect

asMPI Alltoall()is obtained For the processes, use a logical ring arrangement similar to Fig 5.3

Exercise 5.13 Modify the programGather ring()from Fig 5.3 such that syn-chronous send operations (MPI Send()andMPI Recv()) are used Compare the resulting execution time with the execution time obtained for the standard send and receive operations from Fig 5.3

Exercise 5.14 Repeat the previous exercise with buffered send operations.

Exercise 5.15 Modify the program Gather ring()from Fig 5.3 such that the MPI operationMPI Test()is used instead ofMPI Wait() When a non-blocking receive operation is found byMPI Test()to be completed, the process sends the received data block to the next process

Exercise 5.16 Write an MPI program which implements a broadcast operation with

MPI Send()andMPI Recv()operations The program should use n = 2k pro-cesses which should logically be arranged as a hypercube network Based on this arrangement the program should define a spanning tree in the network with root 0, see Fig 3.8 and p 123, and should use this spanning tree to transfer a message

Trang 5

stepwise from the root along the tree edges up to the leaves Each node in the tree receives the message from its parent node and forwards it to its child nodes Measure the resulting runtime for different message sizes up to 1 MB for different numbers of processors usingMPI Wtime()and compare the execution times with the execution times ofMPI Bcast()performing the same operation

Exercise 5.17 The execution time of point-to-point communication operations

between two processors can normally be described by a linear function of the form

ts2s (m) = τ + t c · m,

where m is the size of the message; τ is a startup time, which is independent of the

message size; and t cis the inverse of the network bandwidth Verify this function by

measuring the time for a ping-pong message transmission where process A sends a message to process B, and B sends the same message back to A Use different

mes-sage sizes and draw a diagram which shows the dependence of the communication time on the message size Determine the size ofτ and t con your parallel computer

Exercise 5.18 Write an MPI program which arranges 24 processes in a (periodic)

Cartesian grid structure of dimension 2×3×4 usingMPI Cart create() Each

process should determine and print the process rank of its two neighbors in x , y, and

z directions.

For each of the three sub-grids in y-direction, a communicator should be defined.

This communicator should then be used to determine the maximum rank of the processes in the sub-grid by using an appropriateMPI Reduce()operation This maximum rank should be printed out

Exercise 5.19 Write an MPI program which arranges the MPI processes in a

two-dimensional torus of size √

p × √p where p is the number of processes Each

process exchanges its rank with its two neighbors in x and y dimensions For the

exchange, one-sided communication operations should be used Implement three different schemes for the exchange with the following one-sided communication operations:

(a) global synchronization withMPI Win fence();

(b) loose synchronization by using MPI Win start(), MPI Win post(), MPI Win complete(), andMPI Win wait();

(c) lock synchronization withMPI Win lock()andMPI Win unlock()

Test your program for p= 16 processors, i.e., for a 4 × 4 torus network

Trang 6

Thread Programming

Several parallel computing platforms, in particular multicore platforms, offer a shared address space A natural programming model for these architectures is a thread model in which all threads have access to shared variables These shared variables are then used for information and data exchange To coordinate the access

to shared variables, synchronization mechanisms have to be used to avoid race con-ditions in case of concurrent accesses Basic synchronization mechanisms are lock synchronization and condition synchronization, see Sect 3.7 for an overview

In this chapter, we consider thread programming in more detail In particu-lar, we have a closer look at synchronization problems like deadlocks or prior-ity inversion that might occur and present programming techniques to avoid such problems Moreover, we show how basic synchronization mechanisms like lock synchronization or condition synchronization can be used to build more complex synchronization mechanisms like read/write locks We also present a set of paral-lel patterns like task-based or pipelined processing that can be used to structure a parallel application These issues are considered in the context of popular program-ming environments for thread-based programprogram-ming to directly show the usage of the mechanisms in practice The programming environments Pthreads, Java threads, and OpenMP are introduced in detail For Java, we also give an overview of the pack-agejava.util.concurrentwhich provides many advanced synchronization mechanisms as well as a task-based execution environment The goal of the chapter

is to enable the reader to develop correct and efficient thread programs that can be used, for example, on multicore architectures

6.1 Programming with Pthreads

POSIX threads (also called Pthreads) define a standard for the programming with threads, based on the programming language C The threads of a process share a common address space Thus, the global variables and dynamically generated data objects can be accessed by all threads of a process In addition, each thread has a separate runtime stack which is used to control the functions activated and to store

their local variables These variables declared locally within the functions are local

T Rauber, G R¨unger, Parallel Programming,

DOI 10.1007/978-3-642-04818-0 6,  C Springer-Verlag Berlin Heidelberg 2010

257

Trang 7

data of the executing thread and cannot be accessed directly by other threads Since the runtime stack of a thread is deleted after a thread is terminated, it is dangerous

to pass a reference to a local variable in the runtime stack of a thread A to another thread B

The data types, interface definitions, and macros of Pthreads are usually available via the header file<pthread.h> This header file must therefore be included into

a Pthreads program The functions and data types of Pthreads are defined according

to a naming convention According to this convention, Pthreads functions are named

in the form

pthread[ <object>] <operation> (),

where <operation> describes the operation to be performed and the optional

<object> describes the object to which this operation is applied For example,

pthread mutex init()is a function for the initialization of a mutex variable; thus, the<object> ismutexand the<operation> isinit; we give a more detailed description later

For functions which are involved in the manipulation of threads, the specification

of<object> is omitted For example, the function for the generation of a thread

ispthread create() All Pthread functions yield a return value 0, if they are executed without failure In case of a failure, an error code from<error.h> will

be returned Thus, this header file should also be included in the program Pthread data types describe, similarly to MPI, opaque objects whose exact implementation

is hidden from the programmer Data types are named according to the syntax form pthread <object> t,

where<object> specifies the specific data object For example, a mutex variable

is described by the data typepthread mutex t If<object> is omitted, the

data type pthread tfor threads results The following table contains important Pthread data types which will be described in more detail later

Pthread data types Meaning

pthread mutex t Mutex variable

pthread cond t Condition variable

pthread attr t Thread attributes object

pthread mutexattr t Mutex attributes object

pthread condattr t Condition variable attributes object

pthread once t One-time initialization control context

For the execution of threads, we assume a two-step scheduling method accord-ing to Fig 3.16 in Chap 3, as this is the most general case In this model, the programmer has to partition the program into a suitable number of user threads which can be executed concurrently with each other The user threads are mapped

Trang 8

by the library scheduler to system threads which are then brought to execution on the processors of the computing system by the scheduler of the operating system The programmer cannot control the scheduler of the operating system and has only little influence on the library scheduler Thus, the programmer cannot directly perform the mapping of the user-level threads to the processors of the computing system, e.g.,

by a scheduling at program level This facilitates program development, but also prevents an efficient mapping directly by the programmer according to his specific needs It should be noted that there are operating system–specific extensions that allow thread execution to be bound to specific processors But in most cases, the scheduling provided by the library and the operating system leads to good results and relieves the programmer from additional programming effort, thus providing more benefits than drawbacks

In this section, we give an overview of the programming with Pthreads Sec-tion 6.1.1 describes thread generaSec-tion and management in Pthreads SecSec-tion 6.1.2 describes the lock mechanism for the synchronization of threads accessing shared variables Sections 6.1.3 and 6.1.4 introduce Pthreads condition variables and an extended lock mechanism using condition variables, respectively Sections 6.1.6, 6.1.7, and 6.1.8 describe the use of the basic synchronization techniques in the context of more advanced synchronization patterns, like task pools, pipelining, and client–server coordination Section 6.1.9 discusses additional mechanisms for the control of threads, including scheduling strategies We describe in Sect 6.1.10 how the programmer can influence the scheduling controlled by the library The

phe-nomenon of priority inversion is then explained in Sect 6.1.11 and finally

thread-specific data is considered in Sect 6.1.12 Only the most important mechanisms

of the Pthreads standard are described; for a more detailed description, we refer to [25, 105, 117, 126, 143]

6.1.1 Creating and Merging Threads

When a Pthreads program is started, a single main thread is active, executing the

main()function of the program The main thread can generate more threads by calling the function

int pthread create (pthread t *thread,

const pthread attr t *attr, void *(*start routine)(void *), void *arg)

The first argument is a pointer to an object of typepthread twhich is also

referred to as thread identifier (TID); this TID is generated bypthread create() and can later be used by other Pthreads functions to identify the generated thread The second argument is a pointer to a previously allocated and initialized attribute object of typepthread attr t, defining the desired attributes of the generated thread The argument valueNULL causes the generation of a thread with default

Trang 9

attributes If different attribute values are desired, an attribute data structure has to

be created and initialized before calling pthread create(); this mechanism

is described in more detail in Sect 6.1.9 The third argument specifies the function start routine()which will be executed by the generated thread The specified function should expect a single argument of typevoid *and should have a return value of the same type The fourth argument is a pointer to the argument value with which the thread functionstart routine()will be executed

To execute a thread function with more than one argument, all arguments must

be put into a single data structure; the address of this data structure can then be specified as argument of the thread function If several threads are started by a parent

thread using the same thread function but different argument values, separate data

structures should be used for each of the threads to specify the arguments This avoids situations where argument values are overwritten too early by the parent thread before they are read by the child threads or where different child threads manipulate the argument values in a common data structure concurrently

A thread can determine its own thread identifier by calling the function

pthread t pthread self()

This function returns the thread ID of the calling thread To compare the thread ID

of two threads, the function

int pthread equal (pthread t t1, pthread t t2)

can be used This function returns the value 0 ift1andt2do not refer to the same thread Otherwise, a non-zero value is returned Since pthread tis an opaque data structure, onlypthread equalshould be used to compare thread IDs The number of threads that can be generated by a process is typically limited by the sys-tem The Pthreads standard determines that at least 64 threads can be generated by any process But depending on the specific system used, this limit may be larger For most systems, the maximum number of threads that can be started can be determined

by calling

maxThreads = sysconf ( SC THREAD THREADS MAX)

in the program Knowing this limit, the program can avoid to start more than maxThreadsthreads If the limit is reached, a call of thepthread create() function returns the error valueEAGAIN A thread is terminated if its thread func-tion terminates, e.g., by calling return A thread can terminate itself explicitly by calling the function

void pthread exit (void *valuep)

The argument valuepspecifies the value that will be returned to another thread which waits for the termination of this thread using pthread join() When

Trang 10

a thread terminates its thread function, the functionpthread exit()is called

implicitly, and the return value of the thread function is used as argument of this

implicit call ofpthread exit() After the call topthread exit(), the call-ing thread is terminated, and its runtime stack is freed and can be used by other threads Therefore, the return value of the thread should not be a pointer to a local variable of the thread function or another function called by the thread function These local variables are stored on the runtime stack and may not exist any longer after the termination of the thread Moreover, the memory space of local variables can be reused by other threads, and it can usually not be determined when the mem-ory space is overwritten, thereby destroying the original value of the local variable Instead of a local variable, a global variable or a variable that has been dynamically allocated should be used

A thread can wait for the termination of another thread by calling the function int pthread join (pthread t thread, void **valuep)

The argument thread specifies the thread ID of the thread for which the call-ing thread waits to be terminated The argument valuep specifies a memory address where the return value of this thread should be stored The thread call-ingpthread join()is blocked until the specified thread has terminated Thus, pthread join()provides a possibility for the synchronization of threads After

the thread with TID thread has terminated, its return value is stored at the specified memory address If several threads wait for the termination of the same thread, usingpthread join(), all waiting threads are blocked until the specified thread has terminated But only one of the waiting threads successfully stores the return value For all other waiting threads, the return value ofpthread join()

is the error value ESRCH The runtime system of the Pthreads library allocates for each thread an internal data structure to store information and data needed to control the execution of the thread This internal data structure is preserved by the runtime system also after the termination of the thread to ensure that another thread can later successfully access the return value of the terminated thread using pthread join()

After the call topthread join(), the internal data structure of the terminated thread is released and can no longer be accessed If there is nopthread join() for a specific thread, its internal data structure is not released after its termination and occupies memory space until the complete process is terminated This can be

a problem for large programs with many thread creations and terminations without corresponding calls to pthread join() The preservation of the internal data structure of a thread after its termination can be avoided by calling the function int pthread detach (pthread t thread)

This function notifies the runtime system that the internal data structure of the thread with TIDthreadcan be detached as soon as the thread has terminated A thread may detach itself, and any thread may detach any other thread After a thread has

Ngày đăng: 03/07/2014, 16:21