[Beowulf] performance tweaks and optimum memory configs for a Nehalem

Mon Aug 10 14:07:23 PDT 2009

> Well, as there are only 8 "real" cores, running a computationally
> intensive process across 16 should *definitely* do worse than across 8.

Not typically.

At the SPEC website there are quite a few SPEC MPI2007 (which is an average across 13 HPC applications) results on Nehalem.

Summary:
IBM, SGI and Platform have some comparisons on clusters with "SMT On" of running 1 rank for every core compared to running 2 ranks on every core.  In general, on low core-counts, like up to 32 there is about an 8% advantage for running 2 ranks per core.  At larger core counts, IBM published a pair of results on 64 cores where the 64-rank performance was equal to the 128-rank performance.  Not all of these applications scale linearly, so on some of them you lose efficiency at 128 ranks compared to 64 ranks.

Details: Results from this year are mostly on Nehalem:
http://www.spec.org/mpi2007/results/res2009q3/ (IBM)
http://www.spec.org/mpi2007/results/res2009q2/ (Platform)
http://www.spec.org/mpi2007/results/res2009q1/ (SGI)
  (Intel has results with Turbo mode turned on and off
    in the q2 and q3 results, for a different comparison)

Or you can pick out the Xeon 'X5570' and 'X5560' results from the list of all results:
http://www.spec.org/mpi2007/results/mpi2007.html

In the result index, when 
" Compute Threads Enabled" = 2x "Compute Cores Enabled", then you know SMT is turned on.
In these cases, you can then check that when 
" MPI Ranks" = " Compute Threads Enabled" then you are running 2 ranks per core.

-Tom

> However, it's not so surprising that you're seeing peak performance
> with
> 2-4 threads.  Nehalem can actually overclock itself when only some of
> the
> cores are busy -- it's called Turbo Mode.  That *could* be what you're
> seeing.
> 
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF