[Beowulf] Help with inconsistent network performance
moloney.brendan at gmail.com
Tue Dec 18 20:14:49 PST 2007
Ok guys, thanks for all the feedback.
I guess I should have provided some more specific details. I am using
sockets with TCP/IP for the final gather stage. I am doing real-time
(volume) rendering. The images are 32-bit (RGBA with 8 bits per channel).
The machines are running the 2.6 kernel and I have confirmed that the max
TCP send/recv buffer sizes are 4MB (more than enough to store the full
I wrote two simple test programs to make sure that it was not something else
in my rather complex rendering pipeline (memory allocation etc.). The
server side test program launches N nodes using mpich2, each of which
establishes a connection to the view client with a socket over TCP/IP. Then
I loop with the client side program sending a single integer to rank 0, then
rank 0 broadcasts this integer to the other nodes, and then all nodes send
back 1MB / N of data.
To make sure there was not an issue with the MPI broadcast, I did one test
run with 5 nodes only sending back 4 bytes of data each. The result was a
RTT of less than 0.3 ms. Next I did a run with one node sending 1 MB back to
the client, the result was an RTT of less than 12ms. Letting the test run
in a loop I saw that the first ~100 packets were a bit slower (~16 ms) and
then not a single packet took longer than 14 ms. So the performance was
very consistent, as expected for a single node. Then I did a run with two
nodes sending back 1/2 MB each, the result was an RTT of ~16 ms on frames
without a hiccup. About 0.2% of the frames were hiccups. On a run with 3
nodes sending back 1/3 MB each I got an RTT of ~19-20 ms and again about
0.2% of the frames were hiccups.
With 4 nodes sending 1/4 MB each I got an RTT of ~20-21 ms and about
3.5% of the frames were hiccups.
Finally with 5 nodes sending 1/5 MB each I got an RTT of ~21ms and about
13.5% of the frames were hiccups. I could not test on more nodes as the
other computers were in use by other people.
One interesting pattern I noticed is that the hiccup frame RTTs, almost
without exception, fall into one of three ranges (approximately 50-60,
200-210, and 250-260). Could this be related to exponential back-off?
Tommorow I will experiment with jumbo frames and flow control settings (both
of which the HP Procurve claims to support). If these do not solve the
problems I will start sifting through tcpdump.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf