Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Help with inconsistent network performance

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Brendan Moloney moloney.brendan at gmail.com
Tue Dec 18 21:40:48 PST 2007


On 12/18/07, Mark Hahn <hahn at mcmaster.ca > wrote:
>
> > The machines are running the 2.6 kernel and I have confirmed that the
> max
> > TCP send/recv buffer sizes are 4MB (more than enough to store the full
> > 512x512 image).
>
> the bandwidth-delay product in a lan is low enough to not need
> this kind of tuning.


I didn't actually do any tuning, I just checked the max buffer size that the
linux auto-tuning can use is sufficient.

> I loop with the client side program sending a single integer to rank 0,
> then
> > rank 0 broadcasts this integer to the other nodes, and then all nodes
> send
> > back 1MB / N of data.
>
> hmm, that's a bit harsh, don't you think?  why not have the rank0/master
> as each slave for its contribution sequentially?  sure, it introduces a
> bit
> of "dead air", but it's not as if two slaves can stream to a single master
> at once anyway (each can saturate its link, therefore the master's link is
>
> N-times overcommitted.)


I guess I figured that the data is relatively small compared to the
bandwidth, whereas the latency for ethernet is relatively high.  I also
thought the switch would be able to
efficiently buffer and forward the data.  I am not much of a
networking guy (more a graphics guy) so I realize I could be way off
base here.


> To make sure there was not an issue with the MPI broadcast, I did one test
> > run with 5 nodes only sending back 4 bytes of data each.  The result was
> a
> > RTT of less than 0.3 ms.
>
> isn't that kind of high?  a single ping-pong latency should be ~50 us -
> maybe I'm underestimating the latency of the broadcast itself.


This is quite a bit more than a single ping-pong. The viewer sends to the
master node (rank 0), and then the master node broadcasts to all other
nodes, and then all nodes send back to the viewer node.  I don't know if
this is still seems high?


> One interesting pattern I noticed is that the hiccup frame RTTs, almost
> > without exception, fall into one of three ranges (approximately 50-60,
> > 200-210, and 250-260). Could this be related to exponential back-off?
>
> perhaps introduced by the switch, or perhaps by the fact that the bcast
> isn't implemented as an atomic (eth-level) broadcast.
>

But the bcast is always just sending 4 bytes (a single integer), and as
mentioned above no hiccups occur until the size of the final gather packets
(from all nodes to the viewer) is increased.


>
> > Tommorow I will experiment with jumbo frames and flow control settings
> (both
> > of which the HP Procurve claims to support).  If these do not solve the
> > problems I will start sifting through tcpdump.
>
> I would simply serialize the slaves' responses first.  the current design
> tries to trigger all the slaves to send results at once, which is simply
> not logical if you think about it, since any one slave can saturate
> the master's link.
>

I still have the feeling that the switch should be able to handle this more
efficiently, but since your idea is relatively simple to implement I will
give it a try and see what the performance is like.

Thanks for your input.



>
> regards, mark hahn.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20071218/2946b0f6/attachment.html


More information about the Beowulf mailing list