[Beowulf] Help with inconsistent network performance
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Brendan Moloney moloney.brendan at gmail.comTue Dec 18 20:14:49 PST 2007
- Previous message: [Beowulf] Help with inconsistent network performance
- Next message: [Beowulf] Help with inconsistent network performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ok guys, thanks for all the feedback. I guess I should have provided some more specific details. I am using sockets with TCP/IP for the final gather stage. I am doing real-time (volume) rendering. The images are 32-bit (RGBA with 8 bits per channel). The machines are running the 2.6 kernel and I have confirmed that the max TCP send/recv buffer sizes are 4MB (more than enough to store the full 512x512 image). I wrote two simple test programs to make sure that it was not something else in my rather complex rendering pipeline (memory allocation etc.). The server side test program launches N nodes using mpich2, each of which establishes a connection to the view client with a socket over TCP/IP. Then I loop with the client side program sending a single integer to rank 0, then rank 0 broadcasts this integer to the other nodes, and then all nodes send back 1MB / N of data. To make sure there was not an issue with the MPI broadcast, I did one test run with 5 nodes only sending back 4 bytes of data each. The result was a RTT of less than 0.3 ms. Next I did a run with one node sending 1 MB back to the client, the result was an RTT of less than 12ms. Letting the test run in a loop I saw that the first ~100 packets were a bit slower (~16 ms) and then not a single packet took longer than 14 ms. So the performance was very consistent, as expected for a single node. Then I did a run with two nodes sending back 1/2 MB each, the result was an RTT of ~16 ms on frames without a hiccup. About 0.2% of the frames were hiccups. On a run with 3 nodes sending back 1/3 MB each I got an RTT of ~19-20 ms and again about 0.2% of the frames were hiccups. With 4 nodes sending 1/4 MB each I got an RTT of ~20-21 ms and about 3.5% of the frames were hiccups. Finally with 5 nodes sending 1/5 MB each I got an RTT of ~21ms and about 13.5% of the frames were hiccups. I could not test on more nodes as the other computers were in use by other people. One interesting pattern I noticed is that the hiccup frame RTTs, almost without exception, fall into one of three ranges (approximately 50-60, 200-210, and 250-260). Could this be related to exponential back-off? Tommorow I will experiment with jumbo frames and flow control settings (both of which the HP Procurve claims to support). If these do not solve the problems I will start sifting through tcpdump. Thanks, Brendan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20071218/aae520b7/attachment.html
- Previous message: [Beowulf] Help with inconsistent network performance
- Next message: [Beowulf] Help with inconsistent network performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
