Pentium 4 cluster with Myrinet

Don Holmgren djholm at fnal.gov
Sat Jan 26 17:06:11 PST 2002


The current P4 motherboards with RDRAM are built with either Intel's
i850 (P4) or i860 (dual Xeon) chipsets.  On the i860-based boards, the
P64H bridge used to implement the 64/66 PCI bus doesn't do a good
job; AFAIK this problem existed also with the i840 chipset.  DMA
performance on this bus is far below what you'd expect.  Using Myrinet's
gm_debug command, which outputs the results of some DMA timings which
occur during driver initialization, the best I've seen is about 225
MB/s for bus_read, and 315 MB/s for bus_write.  On many dual Xeon
motherboards, the P64H register is often configured by the BIOS to a
setting (Soft_DT_Timer) appropriate for 4 64/66 slots.  With that
setting, the rates are even worse: 145 for bus_read, 315 for
bus_write.  If your dual Xeon board has only two 64/66 slots, you can
set the Soft_DT_Timer value appropriately to get the better DMA
rate.  In gm_1.5, there's a #define (GM_INTEL_860) you can set in
drivers/linux/gm/gm_arch.c which will set the Soft_DT_Timer value.  Or,
you can set it via setpci (look for device 8086:1360, make sure that
offset 0x50 has a value of 0x04).

On i850-based P4 boards, you'll only have a 32/33 bus.  There's a
problem with those slots as well, with DMA rates down in the 90 MB/sec
range.

In practive, the i860 PCI problem limits Myrinet bandwidth to roughly
165 MB/sec (measured with gm_allsize) with the correct P64H setting, and
to roughly 125 MB/sec with the wrong setting.  The best Myrinet
performance, at least according to www.myri.com pages, is something
close to 250 MB/sec on PIII motherboards based on one of the ServerWorks
chipsets.

So, yes, some (but not all!) PIII motherboards have better 64/66 PCI
buses than the current crop of Xeon motherboards.  New Xeon motherboards
will be out soon (promised this quarter) based on a new ServerWorks
chipset.  If past performance is a good indicator, these should have
better PCI performance.  However, they also support interleaved DDR, not
RDRAM, so memory bandwidth may suffer.  I don't know about i845-based P4
motherboards; perhaps Greg Lindahl's web page with gm_debug results
includes one of these systems (also DDR, not RDRAM).

Whether or not you'll suffer with the poor 64/66 bus on dual Xeons
depends on your code, of course.  For our codes (lattice QCD), there's
such a huge benefit from the memory bandwidth boost from the combination
of RDRAM and the 400 MHz FSB of P4s/Xeons that we're willing to put up
with the PCI bus woes (I'm in the process of ordering 48 duals to expand
our cluster).  There's much less advantage of RDRAM on a PIII board
because PIII's have only a 100 (or 133) MHz FSB.  STREAMS memory
bandwidth ("Copy") numbers on PIII RDRAM boards are about 800 MB/sec
IIRC; on our P4 and Xeon boards the STREAMS number is about 1400 MB/sec
and better than 2000 MB/sec if one hand codes with SSE or uses one of
the SSE-capable compilers.

I have a small cluster of dual Athlons based on the 760MP chipset; these
have 64/33 PCI buses.  gm_debug numbers on these are 240 for bus_read,
227 for bus_write, and gm_allsize has a large message asymptote of about
170 MB/sec.  I've not had a chance to test a 760MPX-based motherboard,
but I believe there are good reports (or at least rumors) of the 64/66
PCI performance.  For our lattice qcd code, these systems fall far
behind our comparable P4 and Xeon systems.  YMMV may vary, of course,
depending on your code; we had to tweak our codes to take advantage of
(ok, defend against) the 64-byte cache line size on the P4 (128 bytes
cache lines for software prefetch).

Don Holmgren
Fermilab



On Fri, 25 Jan 2002, uccatvm wrote:

> Hi all,
> 
> We are in the process of procuring a fairly large computer, and one
> option we are looking at is a Beowulf-type Intel (or AMD) cluster of
> around 50 nodes with Myrinet. For the types of applications we are
> looking at, a single node Pentium 4 with RDRAM performs much better
> than a PIII, probably largely due to the better memory bandwidth.
> One of the vendors tells us that the current generation of P4 or Xeon
> processors are less optimised for I/O, and are therefore less suitable
> for a massively parallel machine, and we are recommended to go for a
> Pentium III cluster instead. Do members of this list know about
> serious issues of this kind with P4s?
> 
> We have also heard horror stories about dual Pentium III machines,
> with up to 40% performance loss if the second CPU is also running a
> calculation. Is this really so bad? I would expect Xeons to be less
> prone to this effect, because the likely bottleneck is the memory
> bandwidth. Is that so? Is there any advantage in using RDRAM or DDRAM
> with a PIII?
> 
> How do Athlons with DDRAM compare (both on the I/O / communication issue
> and general floating point performance)?
> 
> There are different groups involved, but one of the applications we
> would like to run is NWChem, a quantum chemistry program. Computation
> patterns are much like Gaussian, but the program is designed for
> massively parallel computers, so it runs very efficiently in parallel
> (given a fast interconnect). Like Gaussian, it is very memory demanding,
> doing floating point calculations on large arrays, and also uses a fair
> amount of (local) scratch disk I/O.
> 
> I would be grateful for any answers to the questions above.
> 
> See you,
> 
> Tanja
> -- 
>   =====================================================================
>      Tanja van Mourik                                                
>      Royal Society University Research Fellow
>      Chemistry Department 
>      University College London    phone:    +44 (0)20-7679-4663      
>      20 Gordon Street             e-mail:   work: T.vanMourik at ucl.ac.uk 
>      London WC1H 0AJ, UK                    home: tanja at netcomuk.co.uk     
> 
>      http://www.chem.ucl.ac.uk/people/vanmourik/index.html
>   =====================================================================




More information about the Beowulf mailing list