[Beowulf] Odd SuperMicro power off issues
atchley at myri.com
Mon Dec 8 04:10:06 PST 2008
We had a customer with Opterons experience reboots with nothing in the
logs, etc. The only thing we saw with "ipmitool sel list" was:
1 | 11/13/2007 | 10:49:44 | System Firmware Error |
We traced to a HyperTransport deadlock, which by default reboots the
node. Our engineer found this AMD note:
reset through sync-flooding is described in chapter "13.15 Error
Handling" in the following document:
When we changed the default PCI setting for this option (0x50) to off
(i.e. no reboot, 0x40), the node did not reboot but it did hang and
required a IPMI reboot.
Our working assumption is that the traffic of one particular
application running over our NICs induced some pattern of traffic that
caused a flow-control deadlock in HT.
On Dec 7, 2008, at 10:33 PM, Chris Samuel wrote:
> Hi folks,
> We've been tearing our hair out over this for a little
> while and so I'm wondering if anyone else has seen anything
> like this before, or has any thoughts about what could be
> happening ?
> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off. IPMI reports
> it as powered down too.
> No kernel panic, no crash, nothing in the system logs.
> Nothing in the IPMI logs either, it's just sitting there
> as if someone has yanked the power cable (and we're pretty
> sure that's not the cause!).
> There had not been any discernible pattern to the nodes
> affected, and we've only a couple nodes where it's happened
> twice, the rest only have had it happen once and scattered
> over the 3 racks of the cluster.
> For the longest time we had no way to reproduce it, but then
> we noticed that for 3 of the power off's there was a particular
> user running Fluent on there. They've provided us with a copy
> of their problem and we can (often) reproduce it now with that
> problem. Sometimes it'll take 30 minutes or so, sometimes it'll
> take 4-5 hours, sometimes it'll take 3 days or so and sometimes
> it won't do it at all.
> It doesn't appear to be thermal issues as (a) there's nothing in
> the IPMI logs about such problems and (b) we inject CPU and system
> temperature into Ganglia and we don't see anything out of the
> ordinary in those logs. :-(
> We've tried other codes, including HPL, and Advanced Clustering's
> Breakin PXE version, but haven't managed to (yet) get one of the
> nodes to fail with anything except Fluent. :-(
> The only oddity about Fluent is that it's the only code on
> the system that uses HP-MPI, but we used the command line
> switches to tell it to use the Intel MPI it ships with and
> it did the same then too!
> I just cannot understand what is special about Fluent,
> or even how a user code could cause a node to just turn
> off without a trace in the logs.
> Obviously we're pursuing this through the local vendor
> and (through them) SuperMicro, but to be honest we're
> all pretty stumped by this.
> Does anyone have any bright ideas ?
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf