[Beowulf] Help with inconsistent network performance
moloney.brendan at gmail.com
Tue Dec 18 21:40:48 PST 2007
On 12/18/07, Mark Hahn <hahn at mcmaster.ca > wrote:
> > The machines are running the 2.6 kernel and I have confirmed that the
> > TCP send/recv buffer sizes are 4MB (more than enough to store the full
> > 512x512 image).
> the bandwidth-delay product in a lan is low enough to not need
> this kind of tuning.
I didn't actually do any tuning, I just checked the max buffer size that the
linux auto-tuning can use is sufficient.
> I loop with the client side program sending a single integer to rank 0,
> > rank 0 broadcasts this integer to the other nodes, and then all nodes
> > back 1MB / N of data.
> hmm, that's a bit harsh, don't you think? why not have the rank0/master
> as each slave for its contribution sequentially? sure, it introduces a
> of "dead air", but it's not as if two slaves can stream to a single master
> at once anyway (each can saturate its link, therefore the master's link is
> N-times overcommitted.)
I guess I figured that the data is relatively small compared to the
bandwidth, whereas the latency for ethernet is relatively high. I also
thought the switch would be able to
efficiently buffer and forward the data. I am not much of a
networking guy (more a graphics guy) so I realize I could be way off
> To make sure there was not an issue with the MPI broadcast, I did one test
> > run with 5 nodes only sending back 4 bytes of data each. The result was
> > RTT of less than 0.3 ms.
> isn't that kind of high? a single ping-pong latency should be ~50 us -
> maybe I'm underestimating the latency of the broadcast itself.
This is quite a bit more than a single ping-pong. The viewer sends to the
master node (rank 0), and then the master node broadcasts to all other
nodes, and then all nodes send back to the viewer node. I don't know if
this is still seems high?
> One interesting pattern I noticed is that the hiccup frame RTTs, almost
> > without exception, fall into one of three ranges (approximately 50-60,
> > 200-210, and 250-260). Could this be related to exponential back-off?
> perhaps introduced by the switch, or perhaps by the fact that the bcast
> isn't implemented as an atomic (eth-level) broadcast.
But the bcast is always just sending 4 bytes (a single integer), and as
mentioned above no hiccups occur until the size of the final gather packets
(from all nodes to the viewer) is increased.
> > Tommorow I will experiment with jumbo frames and flow control settings
> > of which the HP Procurve claims to support). If these do not solve the
> > problems I will start sifting through tcpdump.
> I would simply serialize the slaves' responses first. the current design
> tries to trigger all the slaves to send results at once, which is simply
> not logical if you think about it, since any one slave can saturate
> the master's link.
I still have the feeling that the switch should be able to handle this more
efficiently, but since your idea is relatively simple to implement I will
give it a try and see what the performance is like.
Thanks for your input.
> regards, mark hahn.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf