[Beowulf] Tips for diagnosing intermittent problems on a small cluster
smulcahy at aplpi.com
Mon Nov 26 00:47:19 PST 2007
Andrew M.A. Cater wrote:
> Same here with on a single machine with an earlier model Tyan board - it
> happened to us either after a very occasional kernel panic/exception or
> after 25-28 days of continuous running. I've got a 2885 here, if I can
> just find two Opterons, memory and a case :-) I'll let you know if this
> one does it too.
> There _may_ be some PSU involvement with ours: the machine and fans are
> running but not accepting connections. You have to disconnect the power
> for a few minutes for it to even boot again properly. Powercycling from
> the front panel doesn't always work
> Debian etch, stock Debian kernel (2.6.18-5 from memory).
We're running the same kernels pretty much.
To be fair to our system, it seems to be rock solid in general and
certainly has no problems switching on or off normally. Nor have we seen
any kernel panics (as far as I can remember).
Nonetheless, I'm hearing some anecdotes relating to some Tyan
motherboards and this kind of behaviour. Until I figure out the root
cause and a more targetted fix, I guess we'll be rebooting more often!
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland. +353.91.751262 http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
More information about the Beowulf