[Beowulf] Help with inconsistent network performance
landman at scalableinformatics.com
Tue Dec 18 09:32:36 PST 2007
Brendan Moloney wrote:
> I have a cluster of 8 Linux machines connected with gigabit
> ethernet (full duplex) to a HP Procurve 2848 switch. I am using the
> machines to do interactive distributed rendering. I have noticed that the
> final gather stage (where the intermediate images from the render nodes are
> sent back to the viewing node) has "hiccups" in the performance. These
How are they sent? NFS? Sockets? ...
> hiccups occur with as few as two render nodes, and become more common as I
> add more render nodes. With a 512x512 image the final gather usually takes
> a few milliseconds for each frame, but when the hiccups occur it is more
> like 200+ milliseconds.
Is this "real time" rendering so that frame rate isthe most important
> Since it is a full duplex switched network, there should not be any
> collisions happening. Since the image is less than 1 MB total, I don't
There could be blocking ... if one unit grabs the single network pipe
of the display node while the another node tries to send data, then the
late node will back off (well with TCP it will) in a pre-determined manner.
> think I am saturating the switch. I have checked the contents of
> /sbin/ifconfig and there are zero erroneous packets being reported. At this
You wouldn't see it there. It would be on the switch, and even then it
wouldn't term it a collision. It is a switch behaving normally.
> point I am really at a loss as to what is causing this. Any input on things
> to check would be greatly appreciated.
I assume you have a single gigabit from the display node to the switch.
As you scale up the number of render nodes, you notice more of these
"hiccups" scaling about linearly with the number of nodes.
This suggests resource contention. Each image would be fragmented into
units of 175 1500-byte packets. This assumes 8 bit images. If you are
using 8 bits per color, 3 colors and an alpha channel, then this is ~700
packets. Each 1500 byte packet takes about 11us to transmit, and has a
non-trivial latency associated with it. I will estimate the latency at
30us (this is switch latency of ~ 5us + network stack latency on each
side of about 12.5us). So for each packet, you have about 41us to
transfer it. If you have 8 bit images, then this corresponds to 7.2
ms. There may be some other caching effects that I am missing, or
mis-computed. For 32 bits (3x 8bit color channels + 1 alpha channel),
this is looking like 28.8 ms for each image. Best case you could do
with this is about 34.7 frames per second.
If on the other hand, you used jumbo frames with 9000 byte packets, you
would need 30 to transfer each image, which would require 67.1us to
move, and still 30 us of latency, for 97.1us per packet. For 30
packets, this is 2.9ms. For the 32 bit version as indicated previously
(3x 8 bit color channels, and one alpha channel) this would be about
11.6ms. Or 85.9 frames per second.
Based on this, I would suggest seeing if changing mtu to 9000 helps.
ifconfig eth0 mtu 9000
on all your nodes (every one).
The argument for this is that you have less latency to pay for, even
though it takes longer to transfer the payload.
Another possibility is channel bonding on your display node.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf