[Beowulf] Varying performance across identical cluster nodes.

Thu Sep 14 06:29:17 PDT 2017

On 09/14/2017 09:25 AM, John Hearns via Beowulf wrote:
> Prentice,   as I understand it the problem here is that with the same 
> OS and IB drivers, there is a big difference in performance between 
> stateful and NFS root nodes.
> Throwing my hat into the ring,   try looking ot see if there is an 
> excessive rate of interrupts in the nfsroot case, coming from the 
> network card:
>
> watch cat /proc/interrupts
>
> You will probably need a large terminal window for this (or probably 
> there is a way to filter the output)

dstat is helpful here.

>
>
>
>
>
>
>
> On 14 September 2017 at 15:14, Prentice Bisbal <pbisbal at pppl.gov 
> <mailto:pbisbal at pppl.gov>> wrote:
>
>     Good question. I just checked using vmstat. When running xhpl on
>     both systems, vmstat shows only zeros for si and so, even long
>     after the performance degrades on the nfsroot instance. Just to be
>     sure, I double-checked with top, which shows 0k of swap being used.
>
>     Prentice
>
>     On 09/13/2017 02:15 PM, Scott Atchley wrote:
>>     Are you swapping?
>>
>>     On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lathama at gmail.com
>>     <mailto:lathama at gmail.com>> wrote:
>>
>>         ack, so maybe validate you can reproduce with another nfs
>>         root. Maybe a lab setup where a single server is serving nfs
>>         root to the node. If you could reproduce in that way then it
>>         would give some direction. Beyond that it sounds like an
>>         interesting problem.
>>
>>         On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
>>         <pbisbal at pppl.gov <mailto:pbisbal at pppl.gov>> wrote:
>>
>>             Okay, based on the various responses I've gotten here and
>>             on other lists, I feel I need to clarify things:
>>
>>             This problem only occurs when I'm running our NFSroot
>>             based version of the OS (CentOS 6). When I run the same
>>             OS installed on a local disk, I do not have this problem,
>>             using the same exact server(s).  For testing purposes,
>>             I'm using LINPACK, and running the same executable  with
>>             the same HPL.dat file in both instances.
>>
>>             Because I'm testing the same hardware using different
>>             OSes, this (should) eliminate the problem being in the
>>             BIOS, and faulty hardware. This leads me to believe it's
>>             most likely a software configuration issue, like a kernel
>>             tuning parameter, or some other software configuration issue.
>>
>>             These are Supermicro servers, and it seems they do not
>>             provide CPU temps. I do see a chassis temp, but not the
>>             temps of the individual CPUs. While I agree that should
>>             be the first thing I look at, it's not an option for me.
>>             Other tools like FLIR and Infrared thermometers aren't
>>             really an option for me, either.
>>
>>             What software configuration, either a kernel a parameter,
>>             configuration of numad or cpuspeed, or some other
>>             setting, could affect this?
>>
>>             Prentice
>>
>>             On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
>>
>>                 Beowulfers,
>>
>>                 I need your assistance debugging a problem:
>>
>>                 I have a dozen servers that are all identical
>>                 hardware: SuperMicro servers with AMD Opteron 6320
>>                 processors. Every since we upgraded to CentOS 6, the
>>                 users have been complaining of wildly inconsistent
>>                 performance across these 12 nodes. I ran LINPACK on
>>                 these nodes, and was able to duplicate the problem,
>>                 with performance varying from ~14 GFLOPS to 64 GFLOPS.
>>
>>                 I've identified that performance on the slower nodes
>>                 starts off fine, and then slowly degrades throughout
>>                 the LINPACK run. For example, on a node with this
>>                 problem, during first LINPACK test, I can see the
>>                 performance drop from 115 GFLOPS down to 11.3 GFLOPS.
>>                 That constant, downward trend continues throughout
>>                 the remaining tests. At the start of subsequent
>>                 tests, performance will jump up to about 9-10 GFLOPS,
>>                 but then drop to 5-6 GLOPS at the end of the test.
>>
>>                 Because of the nature of this problem, I suspect this
>>                 might be a thermal issue. My guess is that the
>>                 processor speed is being throttled to prevent
>>                 overheating on the "bad" nodes.
>>
>>                 But here's the thing: this wasn't a problem until we
>>                 upgraded to CentOS 6. Where I work, we use a
>>                 read-only NFSroot filesystem for our cluster nodes,
>>                 so all nodes are mounting and using the same exact
>>                 read-only image of the operating system. This only
>>                 happens with these SuperMicro nodes, and only with
>>                 the CentOS 6 on NFSroot. RHEL5 on NFSroot worked
>>                 fine, and when I installed CentOS 6 on a local disk,
>>                 the nodes worked fine.
>>
>>                 Any ideas where to look or what to tweak to fix this?
>>                 Any idea why this is only occuring with RHEL 6 w/ NFS
>>                 root OS?
>>
>>
>>             _______________________________________________
>>             Beowulf mailing list, Beowulf at beowulf.org
>>             <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>>             To change your subscription (digest mode or unsubscribe)
>>             visit http://www.beowulf.org/mailman/listinfo/beowulf
>>             <http://www.beowulf.org/mailman/listinfo/beowulf>
>>
>>
>>
>>
>>         -- 
>>         - Andrew "lathama" Latham lathama at gmail.com
>>         <mailto:lathama at gmail.com> http://lathama.com
>>         <http://lathama.org> -
>>
>>         _______________________________________________
>>         Beowulf mailing list, Beowulf at beowulf.org
>>         <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>>         To change your subscription (digest mode or unsubscribe)
>>         visit http://www.beowulf.org/mailman/listinfo/beowulf
>>         <http://www.beowulf.org/mailman/listinfo/beowulf>
>>
>>
>
>
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>     <http://www.beowulf.org/mailman/listinfo/beowulf>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman