[Beowulf] very low performance for very small packets under MPICH (TCP_NODELAY?)

Thu Dec 29 07:25:36 PST 2005

[cross-posted to comp.parallel.mpi]

We have a Beowulf class cluster built on Linux Fedora Core 3 (kernel
2.6.15) with MPI 1.2.7 and Gigabit ethernet with a 3COM Switch and
3C2000-T NIC cards. We detected a very low efficiency in communication
for very small packets (shorter than 16~bytes). The symptoms are the
same as for the problem reported on
http://www.icase.edu/coral/LinuxTCP.html. Monitoring the time needed
to send 1000 packets of 8~bytes long, we see a distribution of times
very similar to those shown in the reference, i.e. most of them have
times below 1e-4secs (near to Ethernet+TCP latency) *BUT* 1 of each 30
packets or so times are in the order of 0.03~secs. This degrades the
average performance for very small packets by a factor of 100.

It seems that this is a well known issue for old kernels. For 2.0.x
and 2.2.x a patch is provided in the link above. I didn't found any
references to these problems for the new kernels and MPICH releases. I
think that, supposedly, this was fixed in MPICH by desabling the Nagle
algorithm by calls to `setsockopt(...TCP_NODELAY...)' .  This calls
are activated in BSD systems and apparently in some SYSV systems like
Linux. In our case I verified that the system is correctly detected as
LINUX by the MPICH configure script, which in turn sets the
`CAN_DO_SETSOCKOPT' flag in the P4 code, which activates the
`setsockopt(...TCP_NODELAY...)' calls in
./mpid/ch_p4/p4/lib/p4_sock_util.c.

Any pointers for understanding why the Nagle algorithm is still active
for the MPI sockets or how to deactivate it will be helpful. Or either
how to deactivate the nagle algorithm at the kernel TCP level.

TIA

Mario