[Beowulf] IB symbol error thresholds for health check scripts ?
samuel at unimelb.edu.au
Mon Jan 10 15:13:27 PST 2011
-----BEGIN PGP SIGNED MESSAGE-----
On 30/12/10 05:29, Stuart Barkley wrote:
> On Mon, 13 Dec 2010 at 17:43 -0000, Christopher Samuel wrote:
>> We run a bunch of health checks  on a compute node through Torque
>>  and if they fail the node gets knocked offline.
> Can you share these scripts? I'm needing to get something started
> along these lines (torque, Moab, Infiniband, IBM system x, xCAT).
> I'm sure I'll find things needing adaption to our environment.
I'll need to check, but I don't think it'll be a problem.
>> One of the checks we do is to check that there are no symbol errors
>> on the IB link. However, I'm wondering if simply saying a single
>> error is too brutal for this - what do other people do about these ?
> I'm looking at Infiniband problems currently and have been watching
> our SymbolErrorCounter values. I'm told a "small number" of these
> errors are okay. I don't know the definition of "small" or over how
> long a time period.
> Over the last week 24 of our nodes have shown at least two errors.
> Of these 6 nodes are showing over 400 errors (450-30000) and these
> nodes need attention (I've manually downed them until I can get to the
> hardware). The remaining nodes are all < 50 errors, with half of
> those < 10.
Our errors seems to be either small numbers which might increase by
one or two over a week (or even less) and those that get hundreds a
second - we don't seem to have any (currently) in between.
> I'm planning to do more proactive monitoring of the Infiniband Fabric.
> The current toolset is very awkward to use for monitoring. There is
> an updated Infiniband Fabric Suite from QLogic which appears to
> improve this significantly. It should be possible to do the
> Infiniband monitoring completely off node so as to not perturb the
> computations too much.
We've got Voltaire switches and Mellanox cards and are using the
/sys interface to the OFED drivers (from RHEL5.5) to get these
We do check the switch out with ibcheckerrors but unfortunately
some of what the switch reports doesn't make sense to us so we've
got a query in to IBM (who supplied the Voltaire switch) to find
out what's going on. Still waiting for a response..
>>  - checks run prior to a job start, after a job exits and every
>> 7.5 minutes (every 10 mom intervals).
> Also when the node comes up before mom starts I assume?
They run on mom start up, so the mom can mark itself
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
-----END PGP SIGNATURE-----
More information about the Beowulf