[Beowulf] [jak at uiuc.edu: Re: [APPL:Xgrid] [Xgrid] Re: megaFlops per Dollar? real world requirements]
Robert G. Brown
rgb at phy.duke.edu
Sat May 14 04:32:48 PDT 2005
On Sun, 15 May 2005, Eugen Leitl wrote:
> ----- Forwarded message from "Jay A. Kreibich" <jak at uiuc.edu> -----
> On Thu, May 12, 2005 at 01:45:45PM -0500, Jay A. Kreibich scratched on the wall:
(A loverly discussion of IPoFW)
Hey, Jay, over here in beowulf-land WE appreciate your analysis. I
found it very useful and entirely believable. In fact, I've seen very
similar behavior to that which you describe in some low end (RTL 8139)
100 Mbps ethernet adapters -- the bit about network throughput going
absolutely to hell in a handbasket if the network becomes congested
enough to saturate the cards. In the case of the RTL, I recall it was a
buffering issue (or rather, the lack thereof) so overruns were dropped
and a full-out TCP stream could drop to 2 Mbps on a particularly bad
We also understand the difference between "latency" (which usually
dominates small packet/message transfer) and "bandwidth" (which is
usually wirespeed less mandatory overhead for some optimal pattern of
data transfer, e.g. large messages in TCP/IP). In fact, your whole
article was very, very cogent.
> > IPoFW performance is very very low. Expect 100Mb Ethernet (yes,
> > that's "one hundred") to provide better performance than 400Mb FW.
> > There was a big discussion about this many months ago that led to
> > Apple removing any referneces to IPoFW from their Xserve and cluster
> > web pages. The utilization difference is that big.
> Since it appears that there are members on this list that disagree
> with me and would rather cuss at me in private than have an
> intelligent, rational discussion with the whole group. Since they
> choose harsh language over running a few simple bandwidth tests,
> I did that myself (numbers below), and will direct my a few comments
> at the group as a whole. Maybe others can contribute some meaningful
> comments. If you disagree with me, at least do it in public.
> > While the raw bandwidth numbers for FireWire are higher, the FireWire
> > MAC layer is designed around block transfers from a disk, tape, or
> > similar device.
> First off, let's be sure we're all on the same page. The original
> question was about the use of Xgrid over FireWire based networks.
> Since Xgrid runs on top of BEEP over TCP/IP, the question really
> boils down to one of performance of IP over FireWire-- e.g., IPoFW.
> It is important to understand that this is not an encapsulation of
> an Ethernet stream on the FireWire link, or some other more traditional
> networking technology, but actually running FireWire as the Layer-2
> transport for IP. RFC-2734 explains how this is done.
> The problem with IPoFW is that FireWire is designed as an
> infrastructure interconnect, not a networking system. It has a lot
> more in common with systems like SCSI, HiPPI, and Fibre Channel
> than it does with systems like Ethernet.
> Since every major networking technology of the last 30 years has been
> frame/package or cell based (and even cell is getting more and more
> rare), it shouldn't be a big shock that most traditional networking
> protocols (e.g. IP) are designed and tuned with these types of physical
> transport layers in mind. While FireWire is much better at large
> bulk transfers, it is not so hot at moving lots of very small data
> segments around, such as individual IP packets.
> In many ways, it is like the difference between a fleet of large
> trucks and a train of piggy-back flat cars. Both are capable of
> transporting the same basic unit of data, but each is designed around
> a different set of requirements. Each has its strength and weakness,
> depending on what you are trying to do. If you're trying to move
> data en mass from a disk (or video camera) to a host system, the
> train model will serve you much better. The connection setup is
> expensive, but the per-unit costs are low assuming a great number of
> units. If, on the other hand, you're trying to download data from
> the web, the truck model is a better deal. The per-unit costs are a
> bit higher, but the system remains fairly efficient with lower
> numbers of units since the connection setup is much less.
> So if you hook two machines together with a FireWire cable, put
> one of those machines into "target disk" mode, and start to copy
> files back and forth, I would expect you get really good performance.
> In fact, despite the fact that GigE has over twice the bandwidth of
> FireWire 400 (GigE = 1000Mbps, FW400 = 400Mbps), I would expect the
> FireWire to outperform any network based file protocol, like NFS or
> AFP, running over GigE, in operations such as a copy. This is exactly
> the type of operation that FireWire is designed to do, so it is no
> shock that it does it extremely efficiently. When used in something
> like target disk mode, it is also operating at a very low level in
> the kernel (on the host side), with a great deal of hardware
> assistance. NFS or AFP, on the other hand, are layered on top of
> the entire networking stack (on both the "disk" side and the "host"
> side) and have to deal with a great number of abstractions. Also,
> because of the hardware design (largely having to do with the size
> of the packets/frames) it is difficult for most hardware to fully
> utilize a GigE connection, so the full 1000Mb can't be used (at all;
> this limit isn't specific to file protocols). So it isn't a big
> shock that a network file protocol doesn't work very efficiently
> and that the slower transport can do a better job-- it is designed
> to do a better job, and you aren't using the technologies in the
> same way. A more valid comparison might be between FW and iSCSI
> over Ethernet so that the two transport technologies are at least
> working at the same level (and even then, I would still expect FW
> to win, although not by as much).
> This is, however a two way street. If we return to the question of
> IPoFW, where you are moving IP packets rather than disk blocks, it
> should be no shock that a transport technology specifically designed
> to move network packets can outperform one that was designed around
> block copies. Ethernet is a very light-weight protocol (which is
> both good and bad, like the trucks) and deals with frame based
> network data extremely well. Even if we assume that FireWire can run
> with a high efficiency, it would be normal to expect GigE to
> outperform it, just because it has 2.5x the bandwidth. But because
> you're asking FireWire to do something it isn't all that great at,
> the numbers are much worse.
> So here's what I did. I hooked by TiBook to my dual 1.25
> QuickSilver. On each I created a new Network Location with just the
> FireWire physical interface, and assigned each one an address on the
> 10 net. There were not other active interfaces. I then ran a
> series of 60 second tests using "iperf" from NLANR, forcing a
> bi-directional data stream over the IPoFW-400 link. I used the TCP
> tests, because this is the only way to have the system do directly
> bandwidth measurements. This adds overhead to the transaction and
> reduces the results (which are indicated as payload bytes only), but
> since I ran the test the same way in all cases, that shouldn't make a
> huge difference.
> Anyways, with the bi-directional test, I was able to get roughly
> 90Mbps (yes, "ninety megabits per second") upstream, and 30Mbps
> downstream using the IPoFW-400 link. It seems there was a lot of
> contention issues when data was pushed both ways at the same time,
> and one side seemed to always gain the upper hand. That's not a
> very good thing for a network to do, and points to self-generated
> congestion issues.
> If I only pushed data in one direction, I could get it up to about
> 125Mbps. I'll grant you that's better than 100baseTX, but I'm not
> sure I consider half-duplex speeds all that interesting. As was
> clear from the other test, when you add data going the other way,
> performance drops considerably.
> Just to be sure I was doing the test correctly, I ran the same tests
> with a point-to-point Ethernet cable between the machines. Both
> machines have GigE, so it ran nicely around 230Mbps in both
> directions. That may sound a bit low, but the TiBook is an older
> machine and the processor isn't super fast. In fact, running 460Mbps
> of data through the TCP stack isn't too bad for an 800MHz machine
> (that's one payload byte per 14 CPU cycles, which is pretty darn good!)
> that isn't running jumbo frames.
> Speed aside, it is also important it point out that the up-stream and
> down-stream numbers were EXACTLY the same. The network seemed to
> have no contention issues, and both sides were able to run at the
> maximum speed the end-hosts could sustain.
> Just for kicks, I manually set both sides to 100Mb/full-duplex and
> ran the test. The numbers worked out to about 92Mbps, both ways.
> A bit lower than you might expect, but given the known overhead of
> TCP it isn't too bad. Again, both sides were able to sustain the
> same rates. It is also worth noting that the CPU loads on the systems
> seemed to be considerably less for this test than the FireWire test,
> even though the amount of data being moved was slightly higher.
> I also ran a few UDP tests. In this case, you force the iperf to
> transmit at a specific rate. If the system or network is unable to
> keep up, packets are simply dropped. In a uni-directional test the
> IPoFW-400 link could absorb 130 Mbps well enough, and was able to
> provide that kind of data rate. When pushed to 200Mbps, the actual
> transmitted data dropped to an astounding *20*Mbps or less. It seems
> that if a FireWire link gets the least be congested, it totally freaks
> out and all performance hits the floor. This isn't a big surprised given
> the upstream/downstream difference in the other tests. These types of
> operating characteristics are extremely undesirable for a network
> transport protocol.
> This wasn't a serious are rigorous test, but it should provide some
> "back of the envelope" numbers to think about. I encourage others
> to run similar tests using various network profiling tools if you
> wish to get better numbers.
> So call it BS if you want, but if we're talking about moving IP
> packets around, I stand by the statement that one should "Expect
> 100Mb Ethernet to provide better performance than 400Mb FW." I'll
> admit the raw numbers are close, and in the case of a nice smooth
> uni-directional data stream, the FW400 link actually out-performed
> what a 100Mb link could deliver-- but the huge performance
> derogation caused by congestion gives me serious pause for a more
> generalized traffic pattern. Regardless, it definitely isn't
> anything near GigE speeds.
> There are also more practical limits to the use of a FireWire network
> vs Ethernet. For starters, from what I understand of FireWire
> "hubs", they are usually repeater based, and not switch based, at
> least in the terms of a more traditional Ethernet network. So while
> the bandwidth numbers are close for a single point-to-point link, I
> would expect the FireWire numbers to drop off drastically when you
> started to link five or six machines together. There is also the
> issue of port density. You can get 24 port non-blocking GigE
> switches for a few thousand bucks. I'm not even sure if a 24 port
> FireWire hub exists. If you start to link multiple smaller hubs
> together (even with a switch style data isolation) your cluster's
> bi-section bandwidth sucks, and your performance is going to suffer.
> Beyond that, FireWire networks are limited to only 63 devices,
> although I would expect that to not be a serious limitation for
> most clusters.
> In short, while running something over FireWire is possible, I see
> very little motivation to do so, especially with the low-cost
> availability of high-performance Ethernet interfaces and switches.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf