[Beowulf] Odd SuperMicro power off issues
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Scott Atchley atchley at myri.comMon Dec 8 04:10:06 PST 2008
- Previous message: RS: [Beowulf] Odd SuperMicro power off issues
- Next message: [Beowulf] Odd SuperMicro power off issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Chris,
We had a customer with Opterons experience reboots with nothing in the
logs, etc. The only thing we saw with "ipmitool sel list" was:
1 | 11/13/2007 | 10:49:44 | System Firmware Error |
We traced to a HyperTransport deadlock, which by default reboots the
node. Our engineer found this AMD note:
reset through sync-flooding is described in chapter "13.15 Error
Handling" in the following document:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/32559.pdf
When we changed the default PCI setting for this option (0x50) to off
(i.e. no reboot, 0x40), the node did not reboot but it did hang and
required a IPMI reboot.
Our working assumption is that the traffic of one particular
application running over our NICs induced some pattern of traffic that
caused a flow-control deadlock in HT.
Scott
On Dec 7, 2008, at 10:33 PM, Chris Samuel wrote:
> Hi folks,
>
> We've been tearing our hair out over this for a little
> while and so I'm wondering if anyone else has seen anything
> like this before, or has any thoughts about what could be
> happening ?
>
> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off. IPMI reports
> it as powered down too.
>
> No kernel panic, no crash, nothing in the system logs.
>
> Nothing in the IPMI logs either, it's just sitting there
> as if someone has yanked the power cable (and we're pretty
> sure that's not the cause!).
>
> There had not been any discernible pattern to the nodes
> affected, and we've only a couple nodes where it's happened
> twice, the rest only have had it happen once and scattered
> over the 3 racks of the cluster.
>
> For the longest time we had no way to reproduce it, but then
> we noticed that for 3 of the power off's there was a particular
> user running Fluent on there. They've provided us with a copy
> of their problem and we can (often) reproduce it now with that
> problem. Sometimes it'll take 30 minutes or so, sometimes it'll
> take 4-5 hours, sometimes it'll take 3 days or so and sometimes
> it won't do it at all.
>
> It doesn't appear to be thermal issues as (a) there's nothing in
> the IPMI logs about such problems and (b) we inject CPU and system
> temperature into Ganglia and we don't see anything out of the
> ordinary in those logs. :-(
>
> We've tried other codes, including HPL, and Advanced Clustering's
> Breakin PXE version, but haven't managed to (yet) get one of the
> nodes to fail with anything except Fluent. :-(
>
> The only oddity about Fluent is that it's the only code on
> the system that uses HP-MPI, but we used the command line
> switches to tell it to use the Intel MPI it ships with and
> it did the same then too!
>
> I just cannot understand what is special about Fluent,
> or even how a user code could cause a node to just turn
> off without a trace in the logs.
>
> Obviously we're pursuing this through the local vendor
> and (through them) SuperMicro, but to be honest we're
> all pretty stumped by this.
>
> Does anyone have any bright ideas ?
>
> cheers,
> Chris
> --
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: RS: [Beowulf] Odd SuperMicro power off issues
- Next message: [Beowulf] Odd SuperMicro power off issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
