[Beowulf] IB symbol error thresholds for health check scripts ?
cap at nsc.liu.se
Mon Jan 3 02:17:22 PST 2011
On Wednesday, December 29, 2010 07:29:21 pm Stuart Barkley wrote:
> On Mon, 13 Dec 2010 at 17:43 -0000, Christopher Samuel wrote:
> > One of the checks we do is to check that there are no symbol errors
> > on the IB link. However, I'm wondering if simply saying a single
> > error is too brutal for this - what do other people do about these ?
> I'm looking at Infiniband problems currently and have been watching
> our SymbolErrorCounter values. I'm told a "small number" of these
> errors are okay. I don't know the definition of "small" or over how
> long a time period.
My personal take on this is that for a week of data or so two digits indicates
a non-perfect link/port (but will probably not be a real problem). Three
digits is a problem, fix it.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 198 bytes
Desc: This is a digitally signed message part.
More information about the Beowulf