[Beowulf] Help with inconsistent network performance
patrick at myri.com
Wed Dec 19 00:26:46 PST 2007
Greg Lindahl wrote:
> On Tue, Dec 18, 2007 at 09:05:41PM -0500, Patrick Geoffray wrote:
>> No, it just means the NIC supports it.
> Well, then how about ethtool -S? That looks like an actual count of
> flow control events, so rx flow control events means the switch
> must support it in some fashion.
If this counter is not null, then you can say the switch does support RX
flow control, which is the most important. However, the NIC driver may
not report these events to ethtool, and you eventually need to generate
some contention in the switch. A simple test is to run a simple MPI code
where several senders streams to a single receiver. If you see a
cumulated bandwidth equal to the receiver link bandwidth, then flow
control works. If you see that all senders have the same bandwidth, then
the switch is fair on top of that.
> Well, we know it can be done perfectly, it's done in InfiniBand
> switches, and that other 10 gig non-ethernet switch, what's it called?
> Oh yeah, Myrinet. They do it, too.
In Ethernet, the sender has to finish sending the current packet before
stopping, so your switch buffers should be able to store a full frame
in addition to the wire delay. In Myrinet (and I presume in IB), the
hardware flow control can stop a sender in the middle of a packet, so
you only have to buffered the wire delay. It's 4 KB per port versus 12
to 16 KB per port. Not trivial and some corners may be cut to save
space/money in the switch chips.
>> Flow-control is not for everyone, and that's why it is often turned off
>> by default. When a sender is paused, it will stop sending anything,
>> including packets for different destinations. Dropping packets is
>> expensive to recover but it keeps things moving.
> Can Myrinet even disable flow control? Odd that Ethrernet is any
> different; dropping any packets is an utter disaster for TCP.
I think it's technically possible to disable flow control in the switch
crossbars in Myrinet, but you would not want to. The NICs can change
routes quickly when they sense contention on a specific path (Quadrics
does the same thing, others can't). That helps a lot for internal hot
spots that are frequent in HPC, but it does nothing against the N->1
communication pattern of death. As Mark pointed out, the best way around
it is to not have it in the first place.
Ethernet switches are often used in more hostile environments where you
can not prevent such N->1 traffic: I could flood a particular machine on
a campus from a couple of host to produce contention, that would
saturate some internal links in the switch that would propagate the
contention to other ports, more links are blocked, etc. If you can
sustain the contention a few seconds on a busy switch, then you can
block the whole thing, complete meltdown.
That's why high-end switch/routers are super expensive, they are way
over-dimensioned inside to be able to handle contentions. That's also
why the FCoE folks are pushing for per-priority flow-control in
Ethernet, so that untrusted/misbehaving traffic can be dropped to not
affect trusted/important FCoE traffic that should not be dropped. And
that's why switch flow-control is turned off by default most of the time.
More information about the Beowulf