[Beowulf] New beowulf recommendations

Tue Oct 9 14:00:33 PDT 2007

Greetings, and thank you so much for your very helpful replies here!
As you can tell, alas, I have less time for correspondence than I used
to! 

Mark Hahn <hahn at mcmaster.ca> writes:

> > 1) There is an onboard Gigabit NIC which pushes the computational load
> >   onto the CPU.
> 
> I doubt it.  it's fairly easy for nics to perform stateless offload,
> and afaik even cheap ones do.  the result is that any nic will
> give nearly the same CPU overhead.  I expect the only "onboard"-ness
> here is that the nic is part of the chipset.  this matters very little,
> since a gigabit nic not going to push the limits of any current bus.
> 

Confirmed!  The on board nic has measurably lower latency without
inducing any discernible cpu load.  There are a few oddities as a
function of matrix size, and it does not support jumbo frames
(irrelevant in the latency debate), so on balance it looks superior. 

We've decided to go with an additional card anyway so that we might
play with GAMMA someday (requires a particular model here).

> >  Our vendor states that a server card in the
> >   PCIExpress slot would have better latency.  True?  Significant?
> 
> peculiar claim, since pcie actually _adds_ a small latency cost;
> they're basing this on offload-type arguments?
> 

This appears to have been a bogus claim, as you suspected.

> > 2) We're considering either a Layer 2 or Layer 3 Netgear 48 port
> >   switch.  The backplane bandwiths are 96Gb and 196Gb respectively,
> >   and the latencies are 20us and 2us.  I don't understand how the
> >   additional bandwidth can be used,
> 
> I'm guessing the l3 switch is GSM7352s, and the l2 is GSM7248.
> while 20 vs 2 us is a big difference, your observed Gb latency
> is still going to be ~50 us, so it's not a huge big deal.
> 

57 us measured across a cross-over cable.  Somewhat
counter-intuitively, 59 us or so through a reasonably old switch.  (A
very old switch gave 270 us).  So my some feat of engineering opaque
to me at least, the switch adds virtually no latency in spite of a
spec-quoted 'latency' >= 20 us.

> if I'm right on the switch models, I think the difference is more
> generational and features.  the 7248 seems like an older-gen switch,
> and lacks not just the L3 stuff but also the 10G options.

Once again, you are right -- the lower quoted l3 latency would appear
irrelevant for our net latency between nodes.

> 
> it's the 10G options that let them claim 196 Gbps for the GSM7352s,
> since besides the 48 normal ports, it's got 8x SFP's and bays for 4x
> 10G stacking ports (which btw only adds up to 192 Gbps for me...)
> 
> I'd consider the GSM7352s mainly if I wanted to use the 10G ports
> (you might verify that the ports can be used for 10G in general,
> rather than only for stacking...)
> 

I don't suppose the 10G impacts latency in any discernible way ....

> > but the latency gain seems reason
> >   enough for the Layer 3.  Is it worth an extra $3k?
> 
> I would guess not unless you want the additional features (routing and
> 10G.)
> 
> >  We are network
> >   latency bound on our existing 16 node cluster, but I do not know
> >   how much latency is due to the switch, nor how to find out.
> 
> well, the simplest test is to connect two nodes back-to-back and run a
> latency test.  compare versus plugged into the switch.
> (gigabit ports are all auto-mdi, so you don't need a special crossover
> cable for this test.)
> 
> I would guess that your current switch is about the same latency as
> the GSM7248, but that you'll measure something like 50 us
> back-to-back.  so dropping 18 us will not make a dramatic difference:
> ie, 70 us vs 52 - that's 25%, but it's still no where near a "real"
> interconnect (myrinet, infiniband, 10G, quadrics).

25% might be interesting, but we can't see 'the fat to cut' even using
a switch several years old (D-link).

> 
> you should also verify that your nics aren't currently doing some sort
> of interrupt mitigation/coalescing, since that will hurt your latency.
> 

Thanks, will look into this.  Modern kernels don't seem to have the io
and irq module options that the older ones did, and I haven't kept up
with the latest means of controlling device interrupts.

> if you are truely small-packet latency-bound, and unwilling to consider
> a higher-performance interconnect, I think you should contemplate putting
> more cores in each box.  going from 2 cores per box to 8 or 16 will make a
> big difference for smallish jobs that use a small number of nodes
> (even if you stick to plain old gigabit).
> 

Alas, we are memory bandwidth bound in this case, our algorithm having
as rate limiting step essentially L2 blas calls.

What is your favorite switch, 48 port Gigabit?  Are they all
essentially equivalent?  Serial console management would be nice. 

Take care, and thanks so much again!

> 
> 

-- 
Camm Maguire			     			camm at enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah