PCI-64: how to find

Donald Becker becker at scyld.com
Wed Feb 27 08:13:52 PST 2002


On Wed, 27 Feb 2002, Patrick Geoffray wrote:
> Velocet wrote:
> > right. I dont know if using LAM/MPI ends up generating the same sized
> > packets, but for a number of the jobs I've run with gromacs under it
> > they all seem to be around 700-1000 bytes. Is that big enough to get a win
> > out of 64 bit busses?
> 
> With IP, the interconnect part of the latency is very small compared to 
> the IP stack overhead. For messages size <1KB, you should not see big 
> difference with IP with 64 bit busses.
> 
> > The cost differences are almost 2x the price for super cheap cards. If you
> > go with quality (intel, higher end Dlinks) its not as big a difference.
> > (ie ARKs being $38 for 32bit and $66 for 64bit for low end stuff).
> > 
> > How much latency is due to the stack then? Can this be improved?
> 
> On a 100 us latency for example, I would say 80% is in the software and 
> 20% in the hardware. Yes, it can be improved if the IP stack in Linux 
> would go zero-copy. Linux folks are talking about it, but it won't be in 
> stable anytime soon.

Small message latency will not be improved with zero copy.  The
added overhead of wiring down pages and doing translations overwhelms
the small cost of copying <1KB messages that are almost always still in
L1 cache.

The current "zero-copy" in the 2.4 kernel is for a very specific case:
large transmits where there are whole-page buffers of data from file.
This is pretty much the only cases where "zero copy" makes sense, and it
is only a win with hardware that generates TCP and IP checksums during
the transmit processing.

Since we are talking about latency, note that hardware TCP/UDP/IP
checksum will result in increased transmit latency.  The NIC has to read
the whole packet before it can insert the checksums into the header and
start transmission.  The standard method of having the main CPU compute
the Tx checksum from usually-cached packet data and using a dynamic
Tx threshold allows the NIC to start transmission sooner.

> Another way to avoid the IP overhead is to use a OS-bypass mechanism for 
> Gigabit Ethernet, like GAMMA or MVIA. However, these layers are 
> NIC-specific. That's the cost of having tons of hardware suppliers: they 
> drive the cost down but diversity is a nightmare for software talking 
> directely to the hardware.

IMHO, GAMMA and MVIA illustrate the difficulty of implementing general
OS-bypass.  If anything, general purpose systems are moving away from
the possibility of OS bypass -- vis. the dramatically increased
processing needed just for the "IP chains" hooks in the protocol stack.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list