[Beowulf] Varying performance across identical cluster nodes.

Joe Landman joe.landman at gmail.com
Fri Sep 8 20:56:28 PDT 2017


On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>
> But here's the thing: this wasn't a problem until we upgraded to 
> CentOS 6. Where I work, we use a read-only NFSroot filesystem for our 
> cluster nodes, so all nodes are mounting and using the same exact 
> read-only image of the operating system. This only happens with these 
> SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on 
> NFSroot worked fine, and when I installed CentOS 6 on a local disk, 
> the nodes worked fine.
>
> Any ideas where to look or what to tweak to fix this? Any idea why 
> this is only occuring with RHEL 6 w/ NFS root OS?
>

Sounds suspiciously like a network or other driver running hard in a 
tight polling mode causing a growing number of CSW/Ints over time. Since 
these are opteron (really? still in use?)  chances are you might have a 
firmware issue on the set of slower nodes, that had been corrected on 
the other nodes.   With NFS root, if you have a node locking a 
particular file that the other nodes want to write to, the node can 
appear slow while it waits on the IO.

You might try running dstat and saving output into a file from boot 
onwards.  Then run the tests, and see if the int or CSW are being driven 
very high.  Pay attention to the usr/idl and other percentages.

You can also grab temperature stats.  Helps if you have ipmi.

     ipmitool sdr

  ipmitool sdr | grep Temp
CPU1 Temp        | 35 degrees C      | ok
CPU2 Temp        | 35 degrees C      | ok
System Temp      | 35 degrees C      | ok
Peripheral Temp  | 38 degrees C      | ok
PCH Temp         | 43 degrees C      | ok

If not, sensors

sensors
Package id 1:  +35.0°C  (high = +82.0°C, crit = +92.0°C)
Core 0:        +35.0°C  (high = +82.0°C, crit = +92.0°C)
Core 1:        +35.0°C  (high = +82.0°C, crit = +92.0°C)
Core 2:        +33.0°C  (high = +82.0°C, crit = +92.0°C)
Core 3:        +34.0°C  (high = +82.0°C, crit = +92.0°C)
...



-- 
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman



More information about the Beowulf mailing list