Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] [jak@uiuc.edu: Re: [APPL:Xgrid] [Xgrid] Re: megaFlops per Dollar? real world requirements]

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Robert G. Brown rgb at phy.duke.edu
Sat May 14 04:32:48 PDT 2005


On Sun, 15 May 2005, Eugen Leitl wrote:

> ----- Forwarded message from "Jay A. Kreibich" <jak at uiuc.edu> -----
> On Thu, May 12, 2005 at 01:45:45PM -0500, Jay A. Kreibich scratched on the wall:

(A loverly discussion of IPoFW)

Hey, Jay, over here in beowulf-land WE appreciate your analysis.  I
found it very useful and entirely believable.  In fact, I've seen very
similar behavior to that which you describe in some low end (RTL 8139)
100 Mbps ethernet adapters -- the bit about network throughput going
absolutely to hell in a handbasket if the network becomes congested
enough to saturate the cards.  In the case of the RTL, I recall it was a
buffering issue (or rather, the lack thereof) so overruns were dropped
and a full-out TCP stream could drop to 2 Mbps on a particularly bad
run.

We also understand the difference between "latency" (which usually
dominates small packet/message transfer) and "bandwidth" (which is
usually wirespeed less mandatory overhead for some optimal pattern of
data transfer, e.g. large messages in TCP/IP).  In fact, your whole
article was very, very cogent.

Thanks!

   rgb

> 
> >   IPoFW performance is very very low.  Expect 100Mb Ethernet (yes,
> >   that's "one hundred") to provide better performance than 400Mb FW.
> >   There was a big discussion about this many months ago that led to
> >   Apple removing any referneces to IPoFW from their Xserve and cluster
> >   web pages.  The utilization difference is that big.
> 
>   Since it appears that there are members on this list that disagree
>   with me and would rather cuss at me in private than have an
>   intelligent, rational discussion with the whole group.  Since they
>   choose harsh language over running a few simple bandwidth tests,
>   I did that myself (numbers below), and will direct my a few comments
>   at the group as a whole.  Maybe others can contribute some meaningful
>   comments.  If you disagree with me, at least do it in public.
> 
> >   While the raw bandwidth numbers for FireWire are higher, the FireWire
> >   MAC layer is designed around block transfers from a disk, tape, or
> >   similar device. 
> 
>   First off, let's be sure we're all on the same page.  The original
>   question was about the use of Xgrid over FireWire based networks.
>   Since Xgrid runs on top of BEEP over TCP/IP, the question really
>   boils down to one of performance of IP over FireWire-- e.g., IPoFW.
>   It is important to understand that this is not an encapsulation of
>   an Ethernet stream on the FireWire link, or some other more traditional
>   networking technology, but actually running FireWire as the Layer-2
>   transport for IP.  RFC-2734 explains how this is done.
> 
>   <http://www.rfc-editor.org/rfc/rfc2734.txt>
> 
>   The problem with IPoFW is that FireWire is designed as an
>   infrastructure interconnect, not a networking system.  It has a lot
>   more in common with systems like SCSI, HiPPI, and Fibre Channel
>   than it does with systems like Ethernet.
> 
>   Since every major networking technology of the last 30 years has been
>   frame/package or cell based (and even cell is getting more and more
>   rare), it shouldn't be a big shock that most traditional networking
>   protocols (e.g. IP) are designed and tuned with these types of physical
>   transport layers in mind.  While FireWire is much better at large
>   bulk transfers, it is not so hot at moving lots of very small data
>   segments around, such as individual IP packets.
>   
>   In many ways, it is like the difference between a fleet of large
>   trucks and a train of piggy-back flat cars.  Both are capable of
>   transporting the same basic unit of data, but each is designed around
>   a different set of requirements.  Each has its strength and weakness,
>   depending on what you are trying to do.  If you're trying to move
>   data en mass from a disk (or video camera) to a host system, the
>   train model will serve you much better.  The connection setup is
>   expensive, but the per-unit costs are low assuming a great number of
>   units.  If, on the other hand,  you're trying to download data from
>   the web, the truck model is a better deal.  The per-unit costs are a
>   bit higher, but the system remains fairly efficient with lower
>   numbers of units since the connection setup is much less.
> 
>   So if you hook two machines together with a FireWire cable, put
>   one of those machines into "target disk" mode, and start to copy
>   files back and forth, I would expect you get really good performance.
>   In fact, despite the fact that GigE has over twice the bandwidth of
>   FireWire 400 (GigE = 1000Mbps, FW400 = 400Mbps), I would expect the
>   FireWire to outperform any network based file protocol, like NFS or
>   AFP, running over GigE, in operations such as a copy.  This is exactly
>   the type of operation that FireWire is designed to do, so it is no
>   shock that it does it extremely efficiently.  When used in something
>   like target disk mode, it is also operating at a very low level in
>   the kernel (on the host side), with a great deal of hardware
>   assistance.  NFS or AFP, on the other hand, are layered on top of
>   the entire networking stack (on both the "disk" side and the "host"
>   side) and have to deal with a great number of abstractions.  Also,
>   because of the hardware design (largely having to do with the size
>   of the packets/frames) it is difficult for most hardware to fully
>   utilize a GigE connection, so the full 1000Mb can't be used (at all;
>   this limit isn't specific to file protocols).  So it isn't a big
>   shock that a network file protocol doesn't work very efficiently
>   and that the slower transport can do a better job-- it is designed
>   to do a better job, and you aren't using the technologies in the
>   same way.  A more valid comparison might be between FW and iSCSI
>   over Ethernet so that the two transport technologies are at least
>   working at the same level (and even then, I would still expect FW
>   to win, although not by as much).
> 
>   This is, however a two way street.  If we return to the question of
>   IPoFW, where you are moving IP packets rather than disk blocks, it
>   should be no shock that a transport technology specifically designed
>   to move network packets can outperform one that was designed around
>   block copies.  Ethernet is a very light-weight protocol (which is
>   both good and bad, like the trucks) and deals with frame based
>   network data extremely well.  Even if we assume that FireWire can run
>   with a high efficiency, it would be normal to expect GigE to
>   outperform it, just because it has 2.5x the bandwidth.  But because
>   you're asking FireWire to do something it isn't all that great at,
>   the numbers are much worse.
> 
>   So here's what I did.  I hooked by TiBook to my dual 1.25
>   QuickSilver.  On each I created a new Network Location with just the
>   FireWire physical interface, and assigned each one an address on the
>   10 net.  There were not other active interfaces.  I then ran a
>   series of 60 second tests using "iperf" from NLANR, forcing a
>   bi-directional data stream over the IPoFW-400 link.  I used the TCP
>   tests, because this is the only way to have the system do directly
>   bandwidth measurements.  This adds overhead to the transaction and
>   reduces the results (which are indicated as payload bytes only), but
>   since I ran the test the same way in all cases, that shouldn't make a
>   huge difference.
> 
>   Anyways, with the bi-directional test, I was able to get roughly
>   90Mbps (yes, "ninety megabits per second") upstream, and 30Mbps
>   downstream using the IPoFW-400 link.  It seems there was a lot of
>   contention issues when data was pushed both ways at the same time,
>   and one side seemed to always gain the upper hand.  That's not a
>   very good thing for a network to do, and points to self-generated
>   congestion issues.
> 
>   If I only pushed data in one direction, I could get it up to about
>   125Mbps.  I'll grant you that's better than 100baseTX, but I'm not
>   sure I consider half-duplex speeds all that interesting.  As was
>   clear from the other test, when you add data going the other way,
>   performance drops considerably.
> 
>   Just to be sure I was doing the test correctly, I ran the same tests
>   with a point-to-point Ethernet cable between the machines.  Both
>   machines have GigE, so it ran nicely around 230Mbps in both
>   directions.  That may sound a bit low, but the TiBook is an older
>   machine and the processor isn't super fast.  In fact, running 460Mbps
>   of data through the TCP stack isn't too bad for an 800MHz machine
>   (that's one payload byte per 14 CPU cycles, which is pretty darn good!)
>   that isn't running jumbo frames.
>   
>   Speed aside, it is also important it point out that the up-stream and
>   down-stream numbers were EXACTLY the same.  The network seemed to
>   have no contention issues, and both sides were able to run at the
>   maximum speed the end-hosts could sustain.
> 
>   Just for kicks, I manually set both sides to 100Mb/full-duplex and
>   ran the test.  The numbers worked out to about 92Mbps, both ways.
>   A bit lower than you might expect, but given the known overhead of
>   TCP it isn't too bad.  Again, both sides were able to sustain the
>   same rates.  It is also worth noting that the CPU loads on the systems
>   seemed to be considerably less for this test than the FireWire test,
>   even though the amount of data being moved was slightly higher.
> 
>   I also ran a few UDP tests.  In this case, you force the iperf to
>   transmit at a specific rate.  If the system or network is unable to
>   keep up, packets are simply dropped.  In a uni-directional test the
>   IPoFW-400 link could absorb 130 Mbps well enough, and was able to
>   provide that kind of data rate.  When pushed to 200Mbps, the actual
>   transmitted data dropped to an astounding *20*Mbps or less.  It seems
>   that if a FireWire link gets the least be congested, it totally freaks
>   out and all performance hits the floor.  This isn't a big surprised given
>   the upstream/downstream difference in the other tests.  These types of
>   operating characteristics are extremely undesirable for a network
>   transport protocol.
> 
>   This wasn't a serious are rigorous test, but it should provide some
>   "back of the envelope" numbers to think about.  I encourage others
>   to run similar tests using various network profiling tools if you
>   wish to get better numbers.
> 
>   So call it BS if you want, but if we're talking about moving IP
>   packets around, I stand by the statement that one should "Expect
>   100Mb Ethernet to provide better performance than 400Mb FW."  I'll
>   admit the raw numbers are close, and in the case of a nice smooth 
>   uni-directional data stream, the FW400 link actually out-performed
>   what a 100Mb link could deliver-- but the huge performance
>   derogation caused by congestion gives me serious pause for a more
>   generalized traffic pattern.  Regardless, it definitely isn't
>   anything near GigE speeds.
> 
> 
>   There are also more practical limits to the use of a FireWire network
>   vs Ethernet.  For starters, from what I understand of FireWire
>   "hubs", they are usually repeater based, and not switch based, at
>   least in the terms of a more traditional Ethernet network.  So while
>   the bandwidth numbers are close for a single point-to-point link, I
>   would expect the FireWire numbers to drop off drastically when you
>   started to link five or six machines together.  There is also the
>   issue of port density.  You can get 24 port non-blocking GigE
>   switches for a few thousand bucks.  I'm not even sure if a 24 port
>   FireWire hub exists.  If you start to link multiple smaller hubs
>   together (even with a switch style data isolation) your cluster's
>   bi-section bandwidth sucks, and your performance is going to suffer.
>   Beyond that, FireWire networks are limited to only 63 devices,
>   although I would expect that to not be a serious limitation for
>   most clusters.
> 
>   In short, while running something over FireWire is possible, I see
>   very little motivation to do so, especially with the low-cost
>   availability of high-performance Ethernet interfaces and switches.
> 
>    -j
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list