[Beowulf] Monitoring and reporting Infiniband errors

John Hearns hearnsj at googlemail.com
Thu Jun 19 06:18:19 PDT 2014


Does anyone have good tips on moniroting a cluster for Infiniband errors?

Specifically Mellanox/OpenFabrics on an SGI cluster.

I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
output.

I have Monit set up on the cluster head node
http://mmonit.com/monit/

which I find quite good

Also if individual nodes could use gmetric to report port errors as a
Ganglia metric I have the ganglia-alert script set up to send email if
ganglia values exceed set thresholds.

Any ideas welcomed please.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140619/99516936/attachment.html>


More information about the Beowulf mailing list