[Beowulf] How to debug slow compute node?

Thu Aug 10 11:04:15 PDT 2017

I put €10 on the nose for a faulty power supply.

On 10 August 2017 at 19:45, Gus Correa <gus at ldeo.columbia.edu> wrote:

> + Leftover processes from previous jobs hogging resources.
> That's relatively common.
> That can trigger swapping, the ultimate performance killer.
> "top" or "htop" on the node should show something.
> (Will go away with a reboot, of course.)
>
> Less likely, but possible:
>
> + Different BIOS configuration w.r.t. the other nodes.
>
> + Poorly sat memory, IB card, etc, or cable connections.
>
> + IPMI may need a hard reset.
> Power down, remove the power cable, wait several minutes,
> put the cable back, power on.
>
> Gus Correa
>
> On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:
>
>> Another thing to perhaps look at. Are you seeing messages abotu thermal
>> throttling events in the system logs?
>> Could that node have a piece of debris caught in its air intake?
>>
>> I dont think that will produce a 30% drop in perfoemance. But I have
>> caught compute nodes with pieces of packaking sucked onto the front,
>> following careless peeople unpacking kit in machine rooms.
>> (Firm rule - no packaging in the machine room. This means you)
>>
>>
>>
>>
>> On 10 August 2017 at 17:00, John Hearns <hearnsj at googlemail.com <mailto:
>> hearnsj at googlemail.com>> wrote:
>>
>>     ps.   Look at   watch  cat /proc/interrupts   also
>>     You might get a qualitative idea of a huge rate of interrupts.
>>
>>
>>     On 10 August 2017 at 16:59, John Hearns <hearnsj at googlemail.com
>>     <mailto:hearnsj at googlemail.com>> wrote:
>>
>>         Faraz,
>>             I think you might have to buy me a virtual coffee. Or a beer!
>>         Please look at the hardware health of that machine. Specifically
>>         the DIMMS.  I have seen this before!
>>         If you have some DIMMS which are faulty and are generating ECC
>>         errors, then if the mcelog service is enabled
>>         an interrupt is generated for every ECC event. SO the system is
>>         spending time servicing these interrupts.
>>
>>         So:   look in your /var/log/mcelog for hardware errors
>>         Look in your /var/log/messages for hardware errors also
>>         Look in the IPMI event logs for ECC errors:    ipmitool sel elist
>>
>>         I would also bring that node down and boot it with memtester.
>>         If there is a DIMM which is that badly faulty then memtester
>>         will discover it within minutes.
>>
>>         Or it could be something else - in which case I get no coffee.
>>
>>         Also Intel cluster checker is intended to exacly deal with these
>>         situations.
>>         What is your cluster manager, and is Intel CLuster Checker
>>         available to you?
>>         I would seriously look at getting this installed.
>>
>>
>>
>>
>>
>>
>>
>>         On 10 August 2017 at 16:39, Faraz Hussain <info at feacluster.com
>>         <mailto:info at feacluster.com>> wrote:
>>
>>             One of our compute nodes runs ~30% slower than others. It
>>             has the exact same image so I am baffled why it is running
>>             slow . I have tested OMP and MPI benchmarks. Everything runs
>>             slower. The cpu usage goes to 2000%, so all looks normal
>> there.
>>
>>             I thought it may have to do with cpu scaling, i.e when the
>>             kernel changes the cpu speed depending on the workload. But
>>             we do not have that enabled on these machines.
>>
>>             Here is a snippet from "cat /proc/cpuinfo". Everything is
>>             identical to our other nodes. Any suggestions on what else
>>             to check? I have tried rebooting it.
>>
>>             processor       : 19
>>             vendor_id       : GenuineIntel
>>             cpu family      : 6
>>             model           : 62
>>             model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
>>             stepping        : 4
>>             cpu MHz         : 2500.098
>>             cache size      : 25600 KB
>>             physical id     : 1
>>             siblings        : 10
>>             core id         : 12
>>             cpu cores       : 10
>>             apicid          : 56
>>             initial apicid  : 56
>>             fpu             : yes
>>             fpu_exception   : yes
>>             cpuid level     : 13
>>             wp              : yes
>>             flags           : fpu vme de pse tsc msr pae mce cx8 apic
>>             sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
>>             sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
>>             constant_tsc arch_perfmon pebs bts rep_good xtopology
>>             nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
>>             vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
>>             x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
>>             lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
>>             flexpriority ept vpid fsgsbase smep erms
>>             bogomips        : 5004.97
>>             clflush size    : 64
>>             cache_alignment : 64
>>             address sizes   : 46 bits physical, 48 bits virtual
>>             power management:
>>
>>
>>
>>             _______________________________________________
>>             Beowulf mailing list, Beowulf at beowulf.org
>>             <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>>             To change your subscription (digest mode or unsubscribe)
>>             visit http://www.beowulf.org/mailman/listinfo/beowulf
>>             <http://www.beowulf.org/mailman/listinfo/beowulf>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170810/a4894cd2/attachment-0001.html>