[Beowulf] Varying performance across identical cluster nodes.
pbisbal at pppl.gov
Thu Sep 14 11:45:21 PDT 2017
I'm happy to announce that I finally found the cause this problem:
numad. On these particular systems, numad was having a catastrophic
effect on the performance. As the jobs ran GFLOPS would steadily
decrease in a monotonic fashion, watching the output of turbostat and
'cpupower monitor' I could see more and more cores becoming idle as the
job ran. As soon as I turned off numad and restarted my LINPACK jobs,
the performance went back up, and now it stayed there for the duration
of the job.
To make sure I wasn't completely crazy for having numad enabled on these
systems, I did a google search and came across the paper below, which
indicates that in some cases having numad is helpful, and in other
cases, it isn't:
To verify this fix, I ran LINPACK again across all the nodes in this
cluster (well, all the nodes that weren't running user jobs at the
time), in addition to the Supermicro nodes. I found that on the
non-Supermicro nodes, which are Proliant servers with different Opteron
processors, turning numad off actually decreased performance by about 5% .
Have any of you had similar problems with numad? Do you leave it on or
off on your cluster nodes? Feedback is greatly appreciated. I did a
google search of 'Linux numad HPC performance' (or something like that),
and the link above was I could find on this topic.
For now, I think I'm going to leave numad enabled on the non-Supermicro
nodes until I can do more research/testing.
On 09/13/2017 01:48 PM, Prentice Bisbal wrote:
> Okay, based on the various responses I've gotten here and on other
> lists, I feel I need to clarify things:
> This problem only occurs when I'm running our NFSroot based version of
> the OS (CentOS 6). When I run the same OS installed on a local disk, I
> do not have this problem, using the same exact server(s). For testing
> purposes, I'm using LINPACK, and running the same executable with the
> same HPL.dat file in both instances.
> Because I'm testing the same hardware using different OSes, this
> (should) eliminate the problem being in the BIOS, and faulty hardware.
> This leads me to believe it's most likely a software configuration
> issue, like a kernel tuning parameter, or some other software
> configuration issue.
> These are Supermicro servers, and it seems they do not provide CPU
> temps. I do see a chassis temp, but not the temps of the individual
> CPUs. While I agree that should be the first thing I look at, it's not
> an option for me. Other tools like FLIR and Infrared thermometers
> aren't really an option for me, either.
> What software configuration, either a kernel a parameter,
> configuration of numad or cpuspeed, or some other setting, could
> affect this?
> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>> I need your assistance debugging a problem:
>> I have a dozen servers that are all identical hardware: SuperMicro
>> servers with AMD Opteron 6320 processors. Every since we upgraded to
>> CentOS 6, the users have been complaining of wildly inconsistent
>> performance across these 12 nodes. I ran LINPACK on these nodes, and
>> was able to duplicate the problem, with performance varying from ~14
>> GFLOPS to 64 GFLOPS.
>> I've identified that performance on the slower nodes starts off fine,
>> and then slowly degrades throughout the LINPACK run. For example, on
>> a node with this problem, during first LINPACK test, I can see the
>> performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant,
>> downward trend continues throughout the remaining tests. At the start
>> of subsequent tests, performance will jump up to about 9-10 GFLOPS,
>> but then drop to 5-6 GLOPS at the end of the test.
>> Because of the nature of this problem, I suspect this might be a
>> thermal issue. My guess is that the processor speed is being
>> throttled to prevent overheating on the "bad" nodes.
>> But here's the thing: this wasn't a problem until we upgraded to
>> CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
>> cluster nodes, so all nodes are mounting and using the same exact
>> read-only image of the operating system. This only happens with these
>> SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
>> NFSroot worked fine, and when I installed CentOS 6 on a local disk,
>> the nodes worked fine.
>> Any ideas where to look or what to tweak to fix this? Any idea why
>> this is only occuring with RHEL 6 w/ NFS root OS?
More information about the Beowulf