[Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?

Chris Samuel chris at csamuel.org
Wed Sep 12 00:40:13 PDT 2018


On Monday, 10 September 2018 2:23:18 PM AEST Jonathan Engwall wrote:

> If it is helpful there are a few similar bugs, generally considered
> unreproducible. One thread calls it bogus xcomp_bv...the kernel clobbers
> itself writing zeroes when that is not the state. And spectre came up. One
> suggestion is to disable IBRS; according to other sources IBRS is dangerous
> to disable and should protect against Spectre. Maybe the OpenFOAM is to
> blame.

Yeah, I suspect what we're seeing is different to that, it looks like 
something manages to generate a SIMD exception whilst the kernel is dealing 
with an APIC timer interrupt.   A colleague has backported this patch that I 
found to our CentOS kernel in case it helps.

https://lore.kernel.org/patchwork/patch/953364/

For now we've constrained this users workload on to a handful of nodes as they 
are trying to get some project work done.

All the best!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC




More information about the Beowulf mailing list