[Beowulf] Monitoring and reporting Infiniband errors

John Hearns hearnsj at googlemail.com
Thu Jun 19 07:10:43 PDT 2014


If anyone is interested, here is my solution, which seems good enough.
Someone will no doubt say there is a neater way!

A shell script which runs ibqueryerrors and returns 1 if anything is found:

#!/bin/bash
# check for errors on the Infiniband fabric 0
# another script runs for port 1

errors=`/usr/sbin/ibqueryerrors -c -s XmtWait -P0 | tail -n +2`
if [ -n "$errors" ] ; then
   echo "Check for errors on Infiniband Fabric 0"
   echo
   echo $errors
   exit 1
else
   exit 0
fi

For Monit monitoring, exit 0 means the service is OK, exit 1 means there is
a problem.

So in monit:

check program ib0-errors with path "/usr/local/bin/check-ib0.sh"
   every "30 * * * *"
   if status == 1 then alert
   alert my.email at domain.com with reminder on 30 cycles
   set mail-format { subject: $DESCRIPTION }



(ps. monit is only returning the first line - to be revised)



On 19 June 2014 14:18, John Hearns <hearnsj at googlemail.com> wrote:

> Does anyone have good tips on moniroting a cluster for Infiniband errors?
>
> Specifically Mellanox/OpenFabrics on an SGI cluster.
>
> I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
> output.
>
> I have Monit set up on the cluster head node
> http://mmonit.com/monit/
>
> which I find quite good
>
> Also if individual nodes could use gmetric to report port errors as a
> Ganglia metric I have the ganglia-alert script set up to send email if
> ganglia values exceed set thresholds.
>
> Any ideas welcomed please.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140619/4f26d8f1/attachment.html>


More information about the Beowulf mailing list