[Beowulf] Varying performance across identical cluster nodes.

Prentice Bisbal pbisbal at pppl.gov
Thu Sep 14 06:24:16 PDT 2017


Another good question. The systems with the nfsroot os still have a 
local disk. That local disk has a /var partition where logs are written. 
Both system do send some logs to a remote log server. While 
/etc/rsyslog.conf files were almost identical, I copied the one from the 
nfsroot system to the local-os system to make sure they were identical. 
This has had no impact on the performance of xhpl.

Prentice

On 09/13/2017 02:16 PM, Scott Atchley wrote:
> Are you logging something goes to the disk in the local case, but that 
> is competing for network bandwidth when NFS mounting?
>
> On Wed, Sep 13, 2017 at 2:15 PM, Scott Atchley 
> <e.scott.atchley at gmail.com <mailto:e.scott.atchley at gmail.com>> wrote:
>
>     Are you swapping?
>
>     On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lathama at gmail.com
>     <mailto:lathama at gmail.com>> wrote:
>
>         ack, so maybe validate you can reproduce with another nfs
>         root. Maybe a lab setup where a single server is serving nfs
>         root to the node. If you could reproduce in that way then it
>         would give some direction. Beyond that it sounds like an
>         interesting problem.
>
>         On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
>         <pbisbal at pppl.gov <mailto:pbisbal at pppl.gov>> wrote:
>
>             Okay, based on the various responses I've gotten here and
>             on other lists, I feel I need to clarify things:
>
>             This problem only occurs when I'm running our NFSroot
>             based version of the OS (CentOS 6). When I run the same OS
>             installed on a local disk, I do not have this problem,
>             using the same exact server(s).  For testing purposes, I'm
>             using LINPACK, and running the same executable  with the
>             same HPL.dat file in both instances.
>
>             Because I'm testing the same hardware using different
>             OSes, this (should) eliminate the problem being in the
>             BIOS, and faulty hardware. This leads me to believe it's
>             most likely a software configuration issue, like a kernel
>             tuning parameter, or some other software configuration issue.
>
>             These are Supermicro servers, and it seems they do not
>             provide CPU temps. I do see a chassis temp, but not the
>             temps of the individual CPUs. While I agree that should be
>             the first thing I look at, it's not an option for me.
>             Other tools like FLIR and Infrared thermometers aren't
>             really an option for me, either.
>
>             What software configuration, either a kernel a parameter,
>             configuration of numad or cpuspeed, or some other setting,
>             could affect this?
>
>             Prentice
>
>             On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>
>                 Beowulfers,
>
>                 I need your assistance debugging a problem:
>
>                 I have a dozen servers that are all identical
>                 hardware: SuperMicro servers with AMD Opteron 6320
>                 processors. Every since we upgraded to CentOS 6, the
>                 users have been complaining of wildly inconsistent
>                 performance across these 12 nodes. I ran LINPACK on
>                 these nodes, and was able to duplicate the problem,
>                 with performance varying from ~14 GFLOPS to 64 GFLOPS.
>
>                 I've identified that performance on the slower nodes
>                 starts off fine, and then slowly degrades throughout
>                 the LINPACK run. For example, on a node with this
>                 problem, during first LINPACK test, I can see the
>                 performance drop from 115 GFLOPS down to 11.3 GFLOPS.
>                 That constant, downward trend continues throughout the
>                 remaining tests. At the start of subsequent tests,
>                 performance will jump up to about 9-10 GFLOPS, but
>                 then drop to 5-6 GLOPS at the end of the test.
>
>                 Because of the nature of this problem, I suspect this
>                 might be a thermal issue. My guess is that the
>                 processor speed is being throttled to prevent
>                 overheating on the "bad" nodes.
>
>                 But here's the thing: this wasn't a problem until we
>                 upgraded to CentOS 6. Where I work, we use a read-only
>                 NFSroot filesystem for our cluster nodes, so all nodes
>                 are mounting and using the same exact read-only image
>                 of the operating system. This only happens with these
>                 SuperMicro nodes, and only with the CentOS 6 on
>                 NFSroot. RHEL5 on NFSroot worked fine, and when I
>                 installed CentOS 6 on a local disk, the nodes worked fine.
>
>                 Any ideas where to look or what to tweak to fix this?
>                 Any idea why this is only occuring with RHEL 6 w/ NFS
>                 root OS?
>
>
>             _______________________________________________
>             Beowulf mailing list, Beowulf at beowulf.org
>             <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>             To change your subscription (digest mode or unsubscribe)
>             visit http://www.beowulf.org/mailman/listinfo/beowulf
>             <http://www.beowulf.org/mailman/listinfo/beowulf>
>
>
>
>
>         -- 
>         - Andrew "lathama" Latham lathama at gmail.com
>         <mailto:lathama at gmail.com> http://lathama.com
>         <http://lathama.org> -
>
>         _______________________________________________
>         Beowulf mailing list, Beowulf at beowulf.org
>         <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>         To change your subscription (digest mode or unsubscribe) visit
>         http://www.beowulf.org/mailman/listinfo/beowulf
>         <http://www.beowulf.org/mailman/listinfo/beowulf>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170914/87a2bd26/attachment.html>


More information about the Beowulf mailing list