[Beowulf] [jak at uiuc.edu: Re: [APPL:Xgrid] [Xgrid] Re: megaFlops per Dollar? real world requirements]

Sat May 14 04:32:48 PDT 2005

On Sun, 15 May 2005, Eugen Leitl wrote:

> ----- Forwarded message from "Jay A. Kreibich" <jak at uiuc.edu> -----
> On Thu, May 12, 2005 at 01:45:45PM -0500, Jay A. Kreibich scratched on the wall:

(A loverly discussion of IPoFW)

Hey, Jay, over here in beowulf-land WE appreciate your analysis.  I
found it very useful and entirely believable.  In fact, I've seen very
similar behavior to that which you describe in some low end (RTL 8139)
100 Mbps ethernet adapters -- the bit about network throughput going
absolutely to hell in a handbasket if the network becomes congested
enough to saturate the cards.  In the case of the RTL, I recall it was a
buffering issue (or rather, the lack thereof) so overruns were dropped
and a full-out TCP stream could drop to 2 Mbps on a particularly bad
run.

We also understand the difference between "latency" (which usually
dominates small packet/message transfer) and "bandwidth" (which is
usually wirespeed less mandatory overhead for some optimal pattern of
data transfer, e.g. large messages in TCP/IP).  In fact, your whole
article was very, very cogent.

Thanks!

   rgb

> 
> >   IPoFW performance is very very low.  Expect 100Mb Ethernet (yes,
> >   that's "one hundred") to provide better performance than 400Mb FW.
> >   There was a big discussion about this many months ago that led to
> >   Apple removing any referneces to IPoFW from their Xserve and cluster
> >   web pages.  The utilization difference is that big.
> 
>   Since it appears that there are members on this list that disagree
>   with me and would rather cuss at me in private than have an
>   intelligent, rational discussion with the whole group.  Since they
>   choose harsh language over running a few simple bandwidth tests,
>   I did that myself (numbers below), and will direct my a few comments
>   at the group as a whole.  Maybe others can contribute some meaningful
>   comments.  If you disagree with me, at least do it in public.
> 
> >   While the raw bandwidth numbers for FireWire are higher, the FireWire
> >   MAC layer is designed around block transfers from a disk, tape, or
> >   similar device. 
> 
>   First off, let's be sure we're all on the same page.  The original
>   question was about the use of Xgrid over FireWire based networks.
>   Since Xgrid runs on top of BEEP over TCP/IP, the question really
>   boils down to one of performance of IP over FireWire-- e.g., IPoFW.
>   It is important to understand that this is not an encapsulation of
>   an Ethernet stream on the FireWire link, or some other more traditional
>   networking technology, but actually running FireWire as the Layer-2
>   transport for IP.  RFC-2734 explains how this is done.
> 
>   <http://www.rfc-editor.org/rfc/rfc2734.txt>
> 
>   The problem with IPoFW is that FireWire is designed as an
>   infrastructure interconnect, not a networking system.  It has a lot
>   more in common with systems like SCSI, HiPPI, and Fibre Channel
>   than it does with systems like Ethernet.
> 
>   Since every major networking technology of the last 30 years has been
>   frame/package or cell based (and even cell is getting more and more
>   rare), it shouldn't be a big shock that most traditional networking
>   protocols (e.g. IP) are designed and tuned with these types of physical
>   transport layers in mind.  While FireWire is much better at large
>   bulk transfers, it is not so hot at moving lots of very small data
>   segments around, such as individual IP packets.
>   
>   In many ways, it is like the difference between a fleet of large
>   trucks and a train of piggy-back flat cars.  Both are capable of
>   transporting the same basic unit of data, but each is designed around
>   a different set of requirements.  Each has its strength and weakness,
>   depending on what you are trying to do.  If you're trying to move
>   data en mass from a disk (or video camera) to a host system, the
>   train model will serve you much better.  The connection setup is
>   expensive, but the per-unit costs are low assuming a great number of
>   units.  If, on the other hand,  you're trying to download data from
>   the web, the truck model is a better deal.  The per-unit costs are a
>   bit higher, but the system remains fairly efficient with lower
>   numbers of units since the connection setup is much less.
> 
>   So if you hook two machines together with a FireWire cable, put
>   one of those machines into "target disk" mode, and start to copy
>   files back and forth, I would expect you get really good performance.
>   In fact, despite the fact that GigE has over twice the bandwidth of
>   FireWire 400 (GigE = 1000Mbps, FW400 = 400Mbps), I would expect the
>   FireWire to outperform any network based file protocol, like NFS or
>   AFP, running over GigE, in operations such as a copy.  This is exactly
>   the type of operation that FireWire is designed to do, so it is no
>   shock that it does it extremely efficiently.  When used in something
>   like target disk mode, it is also operating at a very low level in
>   the kernel (on the host side), with a great deal of hardware
>   assistance.  NFS or AFP, on the other hand, are layered on top of
>   the entire networking stack (on both the "disk" side and the "host"
>   side) and have to deal with a great number of abstractions.  Also,
>   because of the hardware design (largely having to do with the size
>   of the packets/frames) it is difficult for most hardware to fully
>   utilize a GigE connection, so the full 1000Mb can't be used (at all;
>   this limit isn't specific to file protocols).  So it isn't a big
>   shock that a network file protocol doesn't work very efficiently
>   and that the slower transport can do a better job-- it is designed
>   to do a better job, and you aren't using the technologies in the
>   same way.  A more valid comparison might be between FW and iSCSI
>   over Ethernet so that the two transport technologies are at least
>   working at the same level (and even then, I would still expect FW
>   to win, although not by as much).
> 
>   This is, however a two way street.  If we return to the question of
>   IPoFW, where you are moving IP packets rather than disk blocks, it
>   should be no shock that a transport technology specifically designed
>   to move network packets can outperform one that was designed around
>   block copies.  Ethernet is a very light-weight protocol (which is
>   both good and bad, like the trucks) and deals with frame based
>   network data extremely well.  Even if we assume that FireWire can run
>   with a high efficiency, it would be normal to expect GigE to
>   outperform it, just because it has 2.5x the bandwidth.  But because
>   you're asking FireWire to do something it isn't all that great at,
>   the numbers are much worse.
> 
>   So here's what I did.  I hooked by TiBook to my dual 1.25
>   QuickSilver.  On each I created a new Network Location with just the
>   FireWire physical interface, and assigned each one an address on the
>   10 net.  There were not other active interfaces.  I then ran a
>   series of 60 second tests using "iperf" from NLANR, forcing a
>   bi-directional data stream over the IPoFW-400 link.  I used the TCP
>   tests, because this is the only way to have the system do directly
>   bandwidth measurements.  This adds overhead to the transaction and
>   reduces the results (which are indicated as payload bytes only), but
>   since I ran the test the same way in all cases, that shouldn't make a
>   huge difference.
> 
>   Anyways, with the bi-directional test, I was able to get roughly
>   90Mbps (yes, "ninety megabits per second") upstream, and 30Mbps
>   downstream using the IPoFW-400 link.  It seems there was a lot of
>   contention issues when data was pushed both ways at the same time,
>   and one side seemed to always gain the upper hand.  That's not a
>   very good thing for a network to do, and points to self-generated
>   congestion issues.
> 
>   If I only pushed data in one direction, I could get it up to about
>   125Mbps.  I'll grant you that's better than 100baseTX, but I'm not
>   sure I consider half-duplex speeds all that interesting.  As was
>   clear from the other test, when you add data going the other way,
>   performance drops considerably.
> 
>   Just to be sure I was doing the test correctly, I ran the same tests
>   with a point-to-point Ethernet cable between the machines.  Both
>   machines have GigE, so it ran nicely around 230Mbps in both
>   directions.  That may sound a bit low, but the TiBook is an older
>   machine and the processor isn't super fast.  In fact, running 460Mbps
>   of data through the TCP stack isn't too bad for an 800MHz machine
>   (that's one payload byte per 14 CPU cycles, which is pretty darn good!)
>   that isn't running jumbo frames.
>   
>   Speed aside, it is also important it point out that the up-stream and
>   down-stream numbers were EXACTLY the same.  The network seemed to
>   have no contention issues, and both sides were able to run at the
>   maximum speed the end-hosts could sustain.
> 
>   Just for kicks, I manually set both sides to 100Mb/full-duplex and
>   ran the test.  The numbers worked out to about 92Mbps, both ways.
>   A bit lower than you might expect, but given the known overhead of
>   TCP it isn't too bad.  Again, both sides were able to sustain the
>   same rates.  It is also worth noting that the CPU loads on the systems
>   seemed to be considerably less for this test than the FireWire test,
>   even though the amount of data being moved was slightly higher.
> 
>   I also ran a few UDP tests.  In this case, you force the iperf to
>   transmit at a specific rate.  If the system or network is unable to
>   keep up, packets are simply dropped.  In a uni-directional test the
>   IPoFW-400 link could absorb 130 Mbps well enough, and was able to
>   provide that kind of data rate.  When pushed to 200Mbps, the actual
>   transmitted data dropped to an astounding *20*Mbps or less.  It seems
>   that if a FireWire link gets the least be congested, it totally freaks
>   out and all performance hits the floor.  This isn't a big surprised given
>   the upstream/downstream difference in the other tests.  These types of
>   operating characteristics are extremely undesirable for a network
>   transport protocol.
> 
>   This wasn't a serious are rigorous test, but it should provide some
>   "back of the envelope" numbers to think about.  I encourage others
>   to run similar tests using various network profiling tools if you
>   wish to get better numbers.
> 
>   So call it BS if you want, but if we're talking about moving IP
>   packets around, I stand by the statement that one should "Expect
>   100Mb Ethernet to provide better performance than 400Mb FW."  I'll
>   admit the raw numbers are close, and in the case of a nice smooth 
>   uni-directional data stream, the FW400 link actually out-performed
>   what a 100Mb link could deliver-- but the huge performance
>   derogation caused by congestion gives me serious pause for a more
>   generalized traffic pattern.  Regardless, it definitely isn't
>   anything near GigE speeds.
> 
> 
>   There are also more practical limits to the use of a FireWire network
>   vs Ethernet.  For starters, from what I understand of FireWire
>   "hubs", they are usually repeater based, and not switch based, at
>   least in the terms of a more traditional Ethernet network.  So while
>   the bandwidth numbers are close for a single point-to-point link, I
>   would expect the FireWire numbers to drop off drastically when you
>   started to link five or six machines together.  There is also the
>   issue of port density.  You can get 24 port non-blocking GigE
>   switches for a few thousand bucks.  I'm not even sure if a 24 port
>   FireWire hub exists.  If you start to link multiple smaller hubs
>   together (even with a switch style data isolation) your cluster's
>   bi-section bandwidth sucks, and your performance is going to suffer.
>   Beyond that, FireWire networks are limited to only 63 devices,
>   although I would expect that to not be a serious limitation for
>   most clusters.
> 
>   In short, while running something over FireWire is possible, I see
>   very little motivation to do so, especially with the low-cost
>   availability of high-performance Ethernet interfaces and switches.
> 
>    -j
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu