[Beowulf] Theoretical vs. Actual Performance

Thu Feb 22 09:52:20 PST 2018

On Thu, 22 Feb 2018 09:37:54 -0500 Prentice Bisbal wrote:

> I found literature from AMD stating the
> theoretical performance of these processors is 282 GFLOPS, and my
> LINPACK performance isn't coming close to that (I get approximately 
> ~33%
> of that). 

That does seem low.  Check the usual culprits:

1.  CPU frequency adjust locked to lowest setting, or set to one which 
adjusts and which then interacts poorly with the test software.  You 
know that the rated performance will have been measured with the CPU 
locked to its highest frequency.

2.  something else running, especially something which forces the test 
program out of memory or file caches.  I wouldn't expect this sort of 
test to be IO bound to disk, but if it is, and hugepages are used, 
enormous performance drops may be observed when the system decides to 
move those around.  I wouldn't put it past AMD or Intel to run these 
sorts of tests with the test system stripped down to the bones.  No 
network, no logging, single user, etc.  That is, absolutely nothing that 
would compete for CPU time.  (Just checked on one of our big systems.  
ps -ef | wc shows 953 processes:  48 migration, 48 ksoftirqd, 49 
stopper, 49 watchdog, 49 kintegrityd, 49 kblockd, 49 ata_sff, 49 md, 49 
md_misc, 49 aio, 49 crypto, 49 kthrotld, 49 rpciod, 19 gdm (console 
processes, even with no display attached at the moment and nobody logged 
in there), 193 events, 12 of my processes, and 107 miscellaneous OS 
processes.)

3.  ulimit settings.  /etc/security/limits.conf settings.

4.  NUMA issues.  Multithreaded programs have been observed which 
allocate a large block of memory once, which ends up on one side of a 
NUMA system and then start some or all of the threads on the other.  
Those on the wrong side will run a variable amount slower than those on 
the right side.   If this is what is going on locking all threads to the 
same side of the system (if it has just two sides) can speed things up a 
bit.  Assuming it isn't supposed to use all threads.

5.  Different compiler/optimization.  The vendor may have used a binary 
which was tweaked to the Nth degree, perhaps even using profiling from 
earlier runs to optimize the final run.  If you are using a benchmark 
number from AMD see if you can obtain the exact same version of the test 
software that they used (which is maybe available), so that you can 
eliminate this variable.  Perhaps wherever they keep that they also have 
a detailed description of the test system?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech