[Beowulf] Varying performance across identical cluster nodes.

Thu Sep 14 11:45:21 PDT 2017

Beowulfers,

I'm happy to announce that I finally found the cause this problem: 
numad. On these particular systems, numad was having a catastrophic 
effect on the performance. As the jobs ran GFLOPS would steadily 
decrease in a monotonic fashion, watching the output of turbostat and 
'cpupower monitor' I could see more and more cores becoming idle as the 
job ran. As soon as I turned off numad and restarted my LINPACK jobs, 
the performance went back up,  and now it stayed there for the duration 
of the job.

To make sure I wasn't completely crazy for having numad enabled on these 
systems, I did a google search and came across the paper below, which 
indicates that in some cases having numad is helpful, and in other 
cases, it isn't:

http://iopscience.iop.org/article/10.1088/1742-6596/664/9/092010/pdf

To verify this fix, I ran LINPACK again across all the nodes in this 
cluster (well, all the nodes that weren't running user jobs at the 
time), in addition to the Supermicro nodes. I found that on the 
non-Supermicro nodes, which are Proliant servers with different Opteron 
processors, turning numad off actually decreased performance by about 5% .

Have any of you had similar problems with numad? Do you leave it on or 
off on your cluster nodes? Feedback is greatly appreciated. I did a 
google search of 'Linux numad HPC performance' (or something like that), 
and the link above was I could find on this topic.

For now, I think I'm going to leave numad enabled on the non-Supermicro 
nodes until I can do more research/testing.

Prentice

On 09/13/2017 01:48 PM, Prentice Bisbal wrote:
> Okay, based on the various responses I've gotten here and on other 
> lists, I feel I need to clarify things:
>
> This problem only occurs when I'm running our NFSroot based version of 
> the OS (CentOS 6). When I run the same OS installed on a local disk, I 
> do not have this problem, using the same exact server(s).  For testing 
> purposes, I'm using LINPACK, and running the same executable  with the 
> same HPL.dat file in both instances.
>
> Because I'm testing the same hardware using different OSes, this 
> (should) eliminate the problem being in the BIOS, and faulty hardware. 
> This leads me to believe it's most likely a software configuration 
> issue, like a kernel tuning parameter, or some other software 
> configuration issue.
>
> These are Supermicro servers, and it seems they do not provide CPU 
> temps. I do see a chassis temp, but not the temps of the individual 
> CPUs. While I agree that should be the first thing I look at, it's not 
> an option for me. Other tools like FLIR and Infrared thermometers 
> aren't really an option for me, either.
>
> What software configuration, either a kernel a parameter, 
> configuration of numad or cpuspeed, or some other setting, could 
> affect this?
>
> Prentice
>
> On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>> Beowulfers,
>>
>> I need your assistance debugging a problem:
>>
>> I have a dozen servers that are all identical hardware: SuperMicro 
>> servers with AMD Opteron 6320 processors. Every since we upgraded to 
>> CentOS 6, the users have been complaining of wildly inconsistent 
>> performance across these 12 nodes. I ran LINPACK on these nodes, and 
>> was able to duplicate the problem, with performance varying from ~14 
>> GFLOPS to 64 GFLOPS.
>>
>> I've identified that performance on the slower nodes starts off fine, 
>> and then slowly degrades throughout the LINPACK run. For example, on 
>> a node with this problem, during first LINPACK test, I can see the 
>> performance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, 
>> downward trend continues throughout the remaining tests. At the start 
>> of subsequent tests, performance will jump up to about 9-10 GFLOPS, 
>> but then drop to 5-6 GLOPS at the end of the test.
>>
>> Because of the nature of this problem, I suspect this might be a 
>> thermal issue. My guess is that the processor speed is being 
>> throttled to prevent overheating on the "bad" nodes.
>>
>> But here's the thing: this wasn't a problem until we upgraded to 
>> CentOS 6. Where I work, we use a read-only NFSroot filesystem for our 
>> cluster nodes, so all nodes are mounting and using the same exact 
>> read-only image of the operating system. This only happens with these 
>> SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on 
>> NFSroot worked fine, and when I installed CentOS 6 on a local disk, 
>> the nodes worked fine.
>>
>> Any ideas where to look or what to tweak to fix this? Any idea why 
>> this is only occuring with RHEL 6 w/ NFS root OS?
>>
>