[Beowulf] IPoIB failure
bill at princeton.edu
Fri Jan 23 05:39:34 PST 2015
We had a strange event last night. Our IB fabric started demonstrating
some odd routing behavior over IB.
host A could ping both B and C, yet B and C could not ping one another.
This was only at the IP layer. ibping tests all worked fine. A few
runs of ibdiagnet produced all the switches and hosts we expected to find.
As we rebooted hosts with non-connectivity, they came up find but then
host A could reach neither one. After a number of host reboots we soon
realized that we were playing whack-a-mole as the problem resurfaced
sometimes on the original hosts and sometimes on a new host.
In the end we rebooted every Mellanox switch. The big core switch. The
half rack switch. The top of rack switches. And sure enough,
everything came back fine without any more reboots.
At this point all I know is that the server running our master opensm
rebooted and it took a few hours before these problems started, first
indicated by stale filesystem errors across the GPFS mounts.
Obviously, rebooting every dang switch is not the correct answer here.
But at this point I don't have a better solution if it occurs again. Or
even an answer as to WHY it happened in the first place. It just seems
that the IPoIB layer was at fault here somehow in that routing was not
correct across the entire IB network.
If anyone has any insights, I'd be most appreciative. It's clear we do
not understand this aspect of the IB stack and how this layer works.
More information about the Beowulf