[Beowulf] Monitoring and reporting Infiniband errors
hearnsj at googlemail.com
Thu Jun 19 06:18:19 PDT 2014
Does anyone have good tips on moniroting a cluster for Infiniband errors?
Specifically Mellanox/OpenFabrics on an SGI cluster.
I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
I have Monit set up on the cluster head node
which I find quite good
Also if individual nodes could use gmetric to report port errors as a
Ganglia metric I have the ganglia-alert script set up to send email if
ganglia values exceed set thresholds.
Any ideas welcomed please.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf