[Beowulf] Tips for diagnosing intermittent problems on a small
smulcahy at aplpi.com
Wed Nov 21 09:27:57 PST 2007
As I mentioned in my previous posting, the 20 node Tyan S2891 Dual
Opteron dual core Debian cluster (1 NFS providing head node, 19 diskless
compute nodes) is currently experiencing 2 intermittent problems which
I'm trying to diagnose.
After a few days of testing and digging through system logs I'm pretty
much stumped as to what may be causing these. There are 2 separate
problems - anyones opinions on how to go about diagnosing these problems
or things I might have missed would be most welcome.
Over the last 6 months, 3 different nodes have been found in a powered
down state - the nodes seem to have powered off during a run of the
model. There are no interesting messages in the system logs co-inciding
with the time of these shutdowns. My first suspect was the power supply
to cluster but the UPS power system has logged no errors co-inciding
with these failures. I've run a bunch of stress testers on the systems
that failed including cpuburn and cpustress in the hope that a failing
component such as psu or processors would be triggered again -- but all
the systems happily ran 24 hours of tests without any problems.
2 of the 3 failing systems are logging some MCE messages - but they seem
to be standard memory errors which are being corrected by the system.
Any suggestions on where to go next?
On 2 occasions over the last 6 months one of the 2 oceanographic models
we run on this cluster (ROMS, the other being SWAN) has gone into a
state where it is running significantly slower than usual. This seems to
have been preceeded by us running the other model but we can't
reproducibly get the system into this state. Looking at various process
stats - when the model is in the slowed down state - the model goes from
about 30% system cpu time, 60% user cpu time to about 60% system cpu
time and 30% user cpu time. Again, nothing unusual in the logs, nor in
the gigabit switch logs. A quick strace of one of the running model
processes didn't show anything significantly unusual (although I don't
normally sit there watching straces of the model during normal
operational so I could well have missed all sorts of things here).
Again, any suggestions on where to go next on this would be welcome, I'm
wondering if I'm seeing some strange kernel-level or MPI-level problem
which only manifests under certain conditions but I can't even guess at
this stage what those conditions might be.
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland. +353.91.751262 http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
More information about the Beowulf