[Beowulf] Multirail Clusters: need comments

Mon Dec 5 06:41:51 PST 2005

> > so you're talking about 6 GB/s over quad-rail.  it's hard to 
> > imagine what you would do with that kind of bandwidth
> 
> This is a very bold assertion...

well, 6 GB/s does seem like a lot, even for a 16-core machine that's 
barely practical today.  it's also a fat enough node to make you really
want topology-aware MPI.

> > only you can answer this.  my experience is that very few applications
> > need anything like that much bandwidth - I don't think anyone ever saturated 
> > 1x quadrics on our alpha/elan3 clusters (~300 MB/s), and even on
> > opteron+myri-d, latency seems like more of an issue.
> 
> Your alpha/elan3 clusters would have been quad CPU machines.

right, so crudely speaking, we can characterize it as 50 second (4G ram/300)
and 22 flops/byte (833*2*4 mflops/300).

> The one point you have missed however with multi-rail is that network
> bandwidth is per *node* whereas number of CPU's per node is for the

no, that's obvious.

> large part increasing.  1Gb/s seems like a lot (or at least it did) but
> put it in a 16 CPU machine and all of a sudden you have *less* per CPU
> bandwidth than you had seven years ago in your alpha/elan3.  Couple that

5 years :(

well, today 16x is a bit exotic; I think we can agree that 2x2 is probably
the norm.  so a single infinipath link in a 2x2 (say 2.2 DC GHz opteron,
with 16GB).  that leads to 10 seconds, but 11 flops/byte - a different 
balance for sure, but how wrong?

> with CPU's being n times faster to boot and all of a sudden multi-rail
> is starting to less pie-in-the-sky and more look like a good idea.

are cpus or nodes getting faster faster than interconnects?  donno - 
5 years ago, it 4x alphas were a pretty sane choice mainly for lack of 
attractive alternatives.  it's just my perception, but I think there 
might actually be _less_ variance now in cores/node, with 2x2 being
the most common and cost-effective configuration.  4-socket seems to not
be getting all that much traction, though no doubt 4-core chips will
change the core/node average in a couple years.

> appear to hold true, for example given a 16*16 machine average bandwidth
> between two CPUS won't quite double as you double the number of rails
> because 15/256 ranks are local to any given process so will get linear
> bandwidth independent of the number of rails.  This however is simply a
> matter of understanding the topology of the machine.  Another odd case

and indeed of your jobs.  would a bandwidth-intensive program actually
run on all 256 nodes, or would it tend to settle for 16x or 32x runs?
(in the former case, number of rails might be completely moot!)

in summary, I suspect that a "balance" based argument (probably flops/byte)
makes sense, but it's not quite clear how much cpus are outpacing
interconnect bandwidth.  naturally, every application falls at a different
place on this particular metric...

regards, mark hahn.