[Beowulf] x86-64 NUMA vs SMP kernel: appl. performance?

Fri Sep 24 15:31:25 PDT 2004

On Fri, Sep 24, 2004 at 05:47:40PM -0400, Robert G. Brown wrote:
> On Fri, 24 Sep 2004, Greg Lindahl wrote:
> 
> > On Fri, Sep 24, 2004 at 03:04:05PM -0400, Robert G. Brown wrote:
> > 
> > > What compilers have you tried, and what improvements do they
> > > produce?
> > 
> > Robert,
> > 
> > As you might recall, I do work for a compiler company, so obviously that
> > should be kept in mind. The 3 apps mentioned by the original poster
> 
> Impressive results nonetheless (and besides, I trust your honesty).
> Since your customers are reporting them, I would assume that they are
> just swapping the compilers in and out and not necessarily doing lots of
> compiler specific tuning.

I posted very similar numbers (if not the same) to the nwchem list
and was contacted by someone at pathscale.  The comparison I was doing
involved a nwchem simulation a user wanted to purchase a cluster for.
In consideration was g5, itanium 2 and opteron (nacoma wasn't out).

As it turned out IBM xlf+g5 won the price/performance comparison
against pgc+opteron.  I got the pathscale 30 day eval and as it turned
out that tipped the balance towards opteron when considering cluster
price/performance.

There were already what looked to me like reasonable compiler flags
for both, so I used them.  Normally I read the compiler docs, browse the
settings used for specbench runs and attempt to search the optimization
space for at least the low hanging fruit.  Not to mention I didn't have
any way to test correctness, so I figured I'd stick with the (hopefully)
well tested flags.

I was impressed that the pathscale compiler (a newcomer to the market)
managed to compile a wide range of codes with no problems and produce
impressively fast binaries.  Amusingly one of the codes that took just
10-20 minutes to compile usually took 3 days on the itanium 2 + intel
compiler and no it wasn't a particularly fast binary either.

NWchem seems rather bandwidth or latency sensitive depending on the size
and nature of the simulation, for that reason I'm waiting a bit to see how
pci-express and hypertransport connected interconnects play out.  A few
motherboard manufacturers and interconnect companies have been making very
interesting noises as of late.  Seems like for commmunications intensive
codes SGI will have some competition from Octigabay and similar designs.

I have not yet done a similar comparison against the 3.4 GHz nacoma and
intel's 8.1 compiler.  I'm not sure if the 8.1 compiler will cripple
x86-64 running like the previous version did.  Nor have I tested pathscale
1.3.

On a smaller custom production code I did attempt to search the
compiler optimization spaces and the pathscale advantage was even
higher.

> Are these fortran or C results (or do you know)?  And how much do the
> compilers cost (and how do the costs scale over a cluster)?

My understanding (possibly flawed) that produced binaries could be
run on the entire cost so a fixed or a dynamic license would allow a
single user to compile on the head node.  The licensing I believe was
"sticky" for 10 minutes or so.  I believe the pricing is available on
the pathscale website.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis