PCI-64: how to find
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Donald Becker becker at scyld.comWed Feb 27 08:13:52 PST 2002
- Previous message: PCI-64: how to find
- Next message: Impressive stream results (nvidia nforce/crush)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 27 Feb 2002, Patrick Geoffray wrote: > Velocet wrote: > > right. I dont know if using LAM/MPI ends up generating the same sized > > packets, but for a number of the jobs I've run with gromacs under it > > they all seem to be around 700-1000 bytes. Is that big enough to get a win > > out of 64 bit busses? > > With IP, the interconnect part of the latency is very small compared to > the IP stack overhead. For messages size <1KB, you should not see big > difference with IP with 64 bit busses. > > > The cost differences are almost 2x the price for super cheap cards. If you > > go with quality (intel, higher end Dlinks) its not as big a difference. > > (ie ARKs being $38 for 32bit and $66 for 64bit for low end stuff). > > > > How much latency is due to the stack then? Can this be improved? > > On a 100 us latency for example, I would say 80% is in the software and > 20% in the hardware. Yes, it can be improved if the IP stack in Linux > would go zero-copy. Linux folks are talking about it, but it won't be in > stable anytime soon. Small message latency will not be improved with zero copy. The added overhead of wiring down pages and doing translations overwhelms the small cost of copying <1KB messages that are almost always still in L1 cache. The current "zero-copy" in the 2.4 kernel is for a very specific case: large transmits where there are whole-page buffers of data from file. This is pretty much the only cases where "zero copy" makes sense, and it is only a win with hardware that generates TCP and IP checksums during the transmit processing. Since we are talking about latency, note that hardware TCP/UDP/IP checksum will result in increased transmit latency. The NIC has to read the whole packet before it can insert the checksums into the header and start transmission. The standard method of having the main CPU compute the Tx checksum from usually-cached packet data and using a dynamic Tx threshold allows the NIC to start transmission sooner. > Another way to avoid the IP overhead is to use a OS-bypass mechanism for > Gigabit Ethernet, like GAMMA or MVIA. However, these layers are > NIC-specific. That's the cost of having tons of hardware suppliers: they > drive the cost down but diversity is a nightmare for software talking > directely to the hardware. IMHO, GAMMA and MVIA illustrate the difficulty of implementing general OS-bypass. If anything, general purpose systems are moving away from the possibility of OS bypass -- vis. the dramatically increased processing needed just for the "IP chains" hooks in the protocol stack. Donald Becker becker at scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters Annapolis MD 21403 410-990-9993
- Previous message: PCI-64: how to find
- Next message: Impressive stream results (nvidia nforce/crush)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
