[Beowulf] Odd SuperMicro power off issues

Chris Samuel csamuel at vpac.org
Tue Jan 13 18:57:41 PST 2009

----- "Chris Samuel" <csamuel at vpac.org> wrote:

> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports
> it as powered down too.
> No kernel panic, no crash, nothing in the system logs.

I thought that people might be interested in this update,
we'd been trying swapping virtually everything short of the
label on the front of the case (including trying much higher
capacity PSUs) to no avail.

We had one node that I could reliably power off in about
30 seconds to 1 minute by running a certain Gaussian job
which was used as our test platform (other nodes were
far more random, and we've seen this issue on about 1/2
of the 95 nodes so far).

We decided to try 2.6.29-rc1 out in case some of the extra
debug info (e.g. commit 8652cb4b0d87accbe78725fd2a13be2787059649)
helped and were amazed to find that I could no longer kill
it, the Gaussian job ran to completion in about 2 days.

We rebooted back to 2.6.28 (not without issues [1]) and I
killed it again in about 30 seconds.

Rebooted back into 2.6.29-rc1 and it ran happily again.

So whilst I am not saying that the problem is solved
(we would need to see a large proportion of the cluster
running jobs without poweroffs first) I can at least say
that it does seem to be mitigating the problem on this
specific node.

We are now doing a sort of reverse bisection to try and
figure out what fixed it which is going to take a little
time! ;-)

We've got to be careful as the git bisect tool doesn't let
you have the "good" revision after the "bad" one, it assumes
you're trying to find something that broke rather than
something that fixed a problem, so we have to remember to
say "git bisect bad" when it's good and "git bisect good"
when it's bad. ;-)


[1] - The one fly in the ointment is that when we reboot
back into 2.6.28 eth1 can no longer negotiate with our
gigabit switch, we think this is due to some nforce driver
changes, possible commit cb52deba12f27af90a46d2f8667a64888118a888.

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

