[Beowulf] Odd SuperMicro power off issues

Scott Atchley atchley at myri.com
Mon Dec 8 04:10:06 PST 2008


Hi Chris,

We had a customer with Opterons experience reboots with nothing in the  
logs, etc. The only thing we saw with "ipmitool sel list" was:

    1 | 11/13/2007 | 10:49:44 | System Firmware Error |

We traced to a HyperTransport deadlock, which by default reboots the  
node. Our engineer found this AMD note:

reset through sync-flooding is described in chapter "13.15 Error  
Handling" in the following document:

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/32559.pdf

When we changed the default PCI setting for this option (0x50) to off  
(i.e. no reboot, 0x40), the node did not reboot but it did hang and  
required a IPMI reboot.

Our working assumption is that the traffic of one particular  
application running over our NICs induced some pattern of traffic that  
caused a flow-control deadlock in HT.

Scott

On Dec 7, 2008, at 10:33 PM, Chris Samuel wrote:

> Hi folks,
>
> We've been tearing our hair out over this for a little
> while and so I'm wondering if anyone else has seen anything
> like this before, or has any thoughts about what could be
> happening ?
>
> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports
> it as powered down too.
>
> No kernel panic, no crash, nothing in the system logs.
>
> Nothing in the IPMI logs either, it's just sitting there
> as if someone has yanked the power cable (and we're pretty
> sure that's not the cause!).
>
> There had not been any discernible pattern to the nodes
> affected, and we've only a couple nodes where it's happened
> twice, the rest only have had it happen once and scattered
> over the 3 racks of the cluster.
>
> For the longest time we had no way to reproduce it, but then
> we noticed that for 3 of the power off's there was a particular
> user running Fluent on there.   They've provided us with a copy
> of their problem and we can (often) reproduce it now with that
> problem.  Sometimes it'll take 30 minutes or so, sometimes it'll
> take 4-5 hours, sometimes it'll take 3 days or so and sometimes
> it won't do it at all.
>
> It doesn't appear to be thermal issues as (a) there's nothing in
> the IPMI logs about such problems and (b) we inject CPU and system
> temperature into Ganglia and we don't see anything out of the
> ordinary in those logs. :-(
>
> We've tried other codes, including HPL, and Advanced Clustering's
> Breakin PXE version, but haven't managed to (yet) get one of the
> nodes to fail with anything except Fluent. :-(
>
> The only oddity about Fluent is that it's the only code on
> the system that uses HP-MPI, but we used the command line
> switches to tell it to use the Intel MPI it ships with and
> it did the same then too!
>
> I just cannot understand what is special about Fluent,
> or even how a user code could cause a node to just turn
> off without a trace in the logs.
>
> Obviously we're pursuing this through the local vendor
> and (through them) SuperMicro, but to be honest we're
> all pretty stumped by this.
>
> Does anyone have any bright ideas ?
>
> cheers,
> Chris
> -- 
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list