[Beowulf] Monitoring and reporting Infiniband errors

John Hearns hearnsj at googlemail.com
Thu Jun 19 07:14:38 PDT 2014


pps. I guess I could clear the errors every time this runs, but have
decided to just do an initial clear of the errors and look at the
cumulative rate.

ppps. there is a better list for this chatter, isn't there...


On 19 June 2014 15:10, John Hearns <hearnsj at googlemail.com> wrote:

> If anyone is interested, here is my solution, which seems good enough.
> Someone will no doubt say there is a neater way!
>
> A shell script which runs ibqueryerrors and returns 1 if anything is found:
>
> #!/bin/bash
> # check for errors on the Infiniband fabric 0
> # another script runs for port 1
>
> errors=`/usr/sbin/ibqueryerrors -c -s XmtWait -P0 | tail -n +2`
> if [ -n "$errors" ] ; then
>    echo "Check for errors on Infiniband Fabric 0"
>    echo
>    echo $errors
>    exit 1
> else
>    exit 0
> fi
>
> For Monit monitoring, exit 0 means the service is OK, exit 1 means there
> is a problem.
>
> So in monit:
>
> check program ib0-errors with path "/usr/local/bin/check-ib0.sh"
>    every "30 * * * *"
>    if status == 1 then alert
>    alert my.email at domain.com with reminder on 30 cycles
>    set mail-format { subject: $DESCRIPTION }
>
>
>
> (ps. monit is only returning the first line - to be revised)
>
>
>
> On 19 June 2014 14:18, John Hearns <hearnsj at googlemail.com> wrote:
>
>> Does anyone have good tips on moniroting a cluster for Infiniband errors?
>>
>> Specifically Mellanox/OpenFabrics on an SGI cluster.
>>
>> I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
>> output.
>>
>> I have Monit set up on the cluster head node
>> http://mmonit.com/monit/
>>
>> which I find quite good
>>
>> Also if individual nodes could use gmetric to report port errors as a
>> Ganglia metric I have the ganglia-alert script set up to send email if
>> ganglia values exceed set thresholds.
>>
>> Any ideas welcomed please.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140619/af54cf6c/attachment.html>


More information about the Beowulf mailing list