Send buf – memory address of start of message count – number of data items datatype – what type each data item is integer, character, double, float … dest – rank of receiving pro
Trang 1What is MPI?
MPI = Message Passing Interface
Specification of message passing libraries for
developers and users
Not a library by itself, but specifies what such a library should be
Specifies application programming interface (API) for such libraries
Many libraries implement such APIs on different
platforms – MPI libraries
Goal: provide a standard for writing message
passing programs
Portable, efficient, flexible
Language binding: C, C++, FORTRAN programs
Trang 2 Partially implemented in most
libraries; a few full implementations
(e.g ANL MPICH2)
MPI Evolution
Trang 3Why Use MPI?
Standardization: de facto standard for parallel
computing
Not an IEEE or ISO standard, but “industry standard”
Practically replaced all previous message passing
Trang 4 User has full control (data partition, distribution): needs
to identify parallelism and implement parallel algorithms using MPI function calls
Number of CPUs in computation is static
New tasks cannot be dynamically spawned during run time (MPI 1.1)
MPI 2 specifies dynamic process creation and management, but not available in most implementations.
Not necessarily a disadvantage
General assumption: one-to-one mapping of MPI
processes to processors (although not necessarily
always true)
Trang 6 MPI Input/Output (Parallel I/O)
Trang 8General MPI Program Structure
Trang 9On 4 processors:
Trang 10call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank)
call MPI_COMM_SIZE(MPI_COMM_WORLD, num_cpus)
write(*,*) “Hello, I am process “, my_rank, “ among “ & , num_cpus, “ processes”
call MPI_FINALIZE(ierr)
end program hello
Hello, I am process 1 among 4 processes Hello, I am process 2 among 4 processes Hello, I am process 0 among 4 processes Hello, I am process 3 among 4 processes
On 4 processors:
Trang 12MPI Naming Conventions
All names have MPI_ prefix
In FORTRAN:
All subroutine names upper case, last argument is return code
A few functions without return code
Trang 13Initialization
Initialization: MPI_Init() initializes MPI environment; (MPI_Init_thread() if multiple threads)
Must be called before any other MPI routine (so put it at the
beginning of code) except MPI_Initialized() routine.
Can be called only once; subsequent calls are erroneous.
MPI_Initialized() to check if MPI_Init() is called
int main(int argc, char ** argv)
… call MPI_FINALIZE(ierr) end program test
MPI_INIT(ierr)
Trang 14Termination
MPI_Finalize() cleans up MPI environment
Must be called before exits
No other MPI routine can be called after this call,
even MPI_INIT()
Exception: MPI_Initialized() (and
MPI_Get_version(), MPI_Finalized())
Abnormal termination: MPI_Abort()
Makes a best attempt to abort all tasks
Trang 15MPI Processes
MPI is process-oriented: program consists of multiple
processes, each corresponding to one processor
MIMD: Each process runs its own code In practice, runs its own copy of the same code (SPMD)
MPI process and threads: MPI process can contain a
single thread (common case) or multiple threads
Most MPI implementations do not support multiple threads
Needs special processing with that support.
We will assume a single thread per process from now on.
MPI processes are identified by their ranks:
If total nprocs processes in computation, rank ranges from 0,
1, …, nprocs-1 (true in C and FORTRAN).
nprocs does not change during computation.
Trang 16Communicators and
Process Groups
Communicator: is a group of processes that can
communicate with one another
Most MPI routines require a communicator argument to specify the collection of processes the communication is based on
All processes in the computation form the communicator MPI_COMM_WORLD
MPI_COMM_WORLD is pre-defined by MPI, available anywhere
Can create subgroups/subcommunicators within
MPI_COMM_WORLD
A process may belong to different communicators, and have
different ranks in different communicators.
Trang 17How many CPUs, Which one am I …
How many CPUs: MPI_COMM_SIZE()
Who am I: MPI_COMM_RANK()
Can compute data decomposition etc
Know total number of grid points, total number of cpus and
current cpu id; can calc which portion of data current cpu is to
work on.
E.g Poisson equ on a square
Ranks also used to specify source and destination of
MPI_COMM_SIZE(comm,size,ierr)
my_rank value different on different processors !
Trang 18Compiling, Running
MPI standard does not specify how to start up the
program
Compiling and running MPI code implementation dependent
MPI implementations provide utilities/commands for
compiling/running MPI codes
Compile: mpicc, mpiCC, mpif77, mpif90, mpCC, mpxlf …
mpiCC –o myprog myfile.C (cluster)
mpif90 –o myprog myfile.f90 (cluster)
CC –Ipath_mpi_include –o myprog myfile.C –lmpi (SGI)
mpCC –o myprog myfile.C (IBM)
Run: mpirun, poe, prun, ibrun …
mpirun –np 2 myprog (cluster)
mpiexec –np 2 myprog (cluster)
poe myprog –node 1 –tasks_per_node 2 … (IBM)
Trang 196 MPI functions:
MPI_Init() MPI_Finalize() MPI_Comm_rank() MPI_Comm_size() MPI_Send()
MPI_Recv()
Trang 20MPI Communications
Involves a sender and a receiver, one
processor to another processor
Only the two processors participate in
communication
All processors within a communicator
participate in communication (by calling same routine, may pass different arguments);
Barrier, reduction operations, gather, …
Trang 21Point-to-Point Communications
Trang 22Send / Receive
Message data: what to send/receive?
Where is the message? Where to put it?
What kind of data is it? What is the size?
Message envelope: where to send/receive?
…
Trang 23Send
buf – memory address of start of message
count – number of data items
datatype – what type each data item is (integer,
character, double, float …)
dest – rank of receiving process
tag – additional identification of message
comm – communicator, usually MPI_COMM_WORLD
int MPI_Send( void *buf,int count,MPI_Datatype datatype ,
int dest, int tag, MPI_Comm comm )
MPI_SEND( BUF,COUNT,DATATYPE , DEST,TAG,COMM ,IERROR)
<type>BUF(*)
integer COUNT,DATATYPE,DEST,TAG,COMM,IERROR
char message[256];
MPI_Send(message,strlen(message)+1,MPI_CHAR,1,99,MPI_COMM_WORLD);
Trang 24Receive
buf – initial address of receive buffer
count – number of elements in receive buffer (size of receive buffer)
may not equal to the count of items actually received
Actual number of data items received can be obtained by calling
MPI_Get_count().
datatype – data type in receive buffer
source – rank of sending process
tag – additional identification for message
comm – communicator, usually MPI_COMM_WORLD
status – object containing additional info of received message
ierror – return code
int MPI_Recv( void *buf,int count,MPI_Datatype datatype , int source,int tag, MPI_Comm comm ,MPI_Status *status)
MPI_RECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,STATUS,IERROR)
<type>BUF(*)
integer COUNT,DATATYPE,SOURCE,TAG,COMM,STATUS(MPI_STATUS_SIZE),IERROR
Actual number of data items received can be queried from status object; it may be
smaller than count, but cannot be larger (if larger overflow error).
char message[256];
MPI_Recv(message,256,MPI_CHAR,0,99,MPI_COMM_WORLD,&status);
Trang 25MPI_Recv Status
In C: MPI_Status structure, 3 members; MPI_Status status
status.MPI_TAG – tag of received message
status.MPI_SOURCE – source rank of message
status.MPI_ERROR – error code
In FORTRAN: integer array; integer status(MPI_STATUS_SIZE)
Status(MPI_TAG) – tag of received message
status(MPI_SOURCE) – source rank of message
status(MPI_ERROR) – error code
Length of received message: MPI_Get_count()
Int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count) MPI_GET_COUNT(STATUS,DATATYPE,COUNT,IERROR)
Trang 26Message Data
the type indicated by datatype, starting with the entry at the address buf.
MPI data types:
Basic data types: one for each data type in
hosting languages of C/C++, FORTRAN
Derived data type: will learn later.
Trang 27Basic MPI Data Types
MPI datatype FORTRAN datatype
MPI_INTEGER INTEGER MPI_REAL REAL
MPI_DOUBLE_PREC ISION DOUBLE PRECISIONMPI_COMPLEX COMPLEX
MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER(1) MPI_BYTE
MPI_PACKED
MPI datatype C datatype
MPI_CHAR signed char
MPI_SHORT signed short
MPI_INT signed int
MPI_LONG signed long
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
Trang 28grade: 0 2
Trang 29 Source can be a wildcard, MPI_ANY_SOURCE.
Source can also be MPI_PROC_NULL - Return asap, no effect, receive buffer not modified.
Tag: non-negative number, 0, 1, …, UB; UB can be
determined by querying MPI environment (UB>=32767)
For MPI_Recv, can be a wildcard, MPI_ANY_TAG.
Communicator: specified, usually MPI_COMM_WORLD
Trang 30In Order for a Message To be Received …
Message envelopes must match
Message must be directed to the process calling
MPI_Recv
Message source must match that specified by
MPI_Recv, unless MPI_ANY_SOURCE is specified
Message tag must match that specified by MPI_Recv, unless MPI_ANY_TAG is specified
Message communicator must match that specified by MPI_Recv
Data type must match
Datatype specified by MPI_Send and MPI_Recv
must match
(MPI_PACKED can match any other data type.)
Can be more complicated when derived data types are involved
Trang 31MPI_Recv(B, 15, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &status); // ok
// MPI_Recv(B, 15, MPI_FLOAT , 0, tag, MPI_COMM_WORLD, &status); wrong
// MPI_Recv(B,15,MPI_DOUBLE,0, tag1 ,MPI_COMM_WORLD,&status); un-match
// MPI_Recv(B,15,MPI_DOUBLE, 1 ,tag,MPI_COMM_WORLD,&status); un-match
// MPI_Recv(B,15,MPI_DOUBLE, MPI_ANY_SOURCE ,tag,MPI_COMM_WORLD,&status); ok // MPI_Recv(B,15,MPI_DOUBLE,0, MPI_ANY_TAG ,MPI_COMM_WORLD,&status); ok
}
A: 01
Trang 32 MPI_Recv is blocking: returns only after the receive
buffer has the received message
After it returns, the data is here and ready for use.
Non-blocking send/recv: will be discussed later.
Non-blocking calls will return immediately; however, not safe to access
the send/receive buffers Need to call other functions to complete
send/recv, then safe to access/modify send/receive buffers.
Trang 33Buffering
Send and matching receive operations may not
be (and are not) synchronized in reality MPI
implementation must decide what happens when send/recv are out of sync.
Consider:
Send occurs 5 seconds before receive is ready;
where is the message when receive is pending?
Multiple sends arrive at the same receiving task which can receive one send at a time – what happens to the messages that are backing up?
MPI implementation (not the MPI standard)
decides what happens in these cases Typically
a system buffer is used to hold data in transit.
Trang 34Buffering
System buffer:
Invisible to users and
managed by MPI library
Finite resource that can be
easily exhausted
May exist on sending or
receiving side, or both
May improve performance
User can attach own buffer
for MPI message buffering.
Trang 35Communication Modes for Send
Standard mode: MPI_Send
System decides whether the outgoing message will be buffered
or not
Usually, small messages buffering mode; large messages, no buffering, synchronous mode.
Buffered mode: MPI_Bsend
Message will be copied to buffer; Send call then returns
User can attach own buffer for use
Synchronous mode: MPI_Ssend
No buffering.
Will block until a matching receive starts receiving data
Ready mode: MPI_Rsend
Can be used only if a matching receive is already posted; avoid handshake etc
otherwise erroneous.
Trang 37Properties
Order: MPI messages are non-overtaking
If a sender sends two messages in succession to same
destination, and both match the same receive, then this receive will receive the first message no matter which message
physically arrives the receiving end first.
If a receiver posts two receives in succession and both match the same message, then the first receive will be satisfied.
Note: if a receive matches two messages from two
different senders, the receive may receive either one
Trang 38Properties
Progress: if a pair of matching send/recv are
initiated on two processes, at least one of them will complete
Send will complete, unless the receive is satisfied by some other message
Receive will complete, unless send is consumed by some other matching receive
Trang 39Properties
Fairness: no guarantee of fairness
If a message is sent to a destination, and the destination process repeatedly posts a
receive that matches this send, however the message may never be received since it is
each time overtaken by another message
sent from another source.
It is user’s responsibility to prevent the
starvation in such situations.
Trang 40 MPI_Send that cannot complete due to lack
of buffer space will only block, waiting for
buffer space to be available or for a matching receive.
Trang 41Deadlock
Deadlock is a state when the program cannot proceed
Cyclic dependencies cause deadlock
Trang 43Send-receive
Two remedies: non-blocking communication, send-recv
MPI_SENDRECV: combine send and recv in one call
Useful in shift operations; Avoid possible deadlock with circular shift and like operations
Equivalent to: execute a nonblocking send and a nonblocking recv, and then wait for them to complete
int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype,
int dest, int sendtag,
void *recvbuf, int recvcount, MPI_Datatype recvtype,
int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
MPI_SENDRECV(SENDBUF, SENDCOUNT, SENDTYPE, DEST, SENDTAG,
RECVBUF, RECVCOUNT, RECVTYPE, SOURCE, RECVTAG, COMM, STATUS, IERROR)
<type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, SENDTAG, DEST, RECVCOUNT, RECVTYPE, SOURCE,
RECVTAG, COMM, IERROR, STATUS(MPI_STATUS_SIZE)
*sendbuf and *recvbuf may not be the same memory address
Trang 44int my_rank, ncpus;
int left_neighbor, right_neighbor;
MPI_Sendrecv(&my_rank, 1, MPI_INT, left_neighbor, send_tag,
&data_received, 1, MPI_INT, right_neighbor, recv_tag,
MPI_COMM_WORLD, &status);
printf("Among %d processes, process %d received from right neighbor: %d\n",
ncpus, my_rank, data_received);