[Beowulf] MPI - time for packing, unpacking, creating a message...

Tue May 26 04:37:22 PDT 2009

The original question was about relatively small messages - only 500 bytes each

You can often get better throughput if you send say two smaller messages rather than one large one.
This is since the interconnect can generate multiple RDMA requests that can proceed concurrently.

This old paper from 2003 illustrates this
http://www.docstoc.com/docs/5579957/Quadrics-QsNetII-A-network-for-Supercomputing-Applications
Page 25 shows a graph where 1,2,4 and 8 concurrent RDMA are issued concurrently. For large messages (>256KB) there is no significant difference in the achieved total bandwidth - it is limited by the PCIe/PCI-X interface or the interconnect fabric itself.
But at smaller messages sizes there are measurable differences - eg. two 1K messages show higher total bandwidth than a single 2K message.

Daniel

p.s. did you really mean to compare three 500bytes transfers with a single 2000byte transfer, rather than the same total message size in both cases?

pps. Case A is really a broadcast - interconnects that implement broadcast in hardware are bound to do A faster than B

From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Bruno Coutinho
Sent: 23 May 2009 16:44
To: tribur at vision.ee.ethz.ch
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] MPI - time for packing, unpacking, creating a message...

If you are using Gigabit Ethernet with jumbo frames (9000 bytes for example):
A will send 3 packets with 4000 bytes and
B will send one of 9000 bytes and one of 7000 bytes.

For the cpu B is better, because will generate one system call and A will generate three and
as many high speed interconnects today need large packets to fully utilize their bandwidth, I think that B should be faster.
But the only way to be sure is testing.

2009/5/18 <tribur at vision.ee.ethz.ch<mailto:tribur at vision.ee.ethz.ch>>
Hi all,

is there anyone who can tell me if A) or B) is probably faster?

A)
process 0 sends 3x500 elements, e.g. doubles, to 3 different processors using something like
if(rank==0){
MPI_Send(sendbuf, 500, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
MPI_Send(sendbuf, 500, MPI_DOUBLE, 2, 2, MPI_COMM_WORLD);
MPI_Send(sendbuf, 500, MPI_DOUBLE, 3, 3, MPI_COMM_WORLD);
}
else
MPI_Recv(recvbuf, 500, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD, status);

B)
process 0 sends 2000 elements to process 1 using
if(rank==0)
MPI_Send(sendbuf, 2000, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD);
else
MPI_Recv(recvbuf, 2000, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD, status);

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090526/1a1ead61/attachment.html>