[Beowulf] IPoIB failure
cap at nsc.liu.se
Wed Jan 28 02:51:16 PST 2015
On Wed, 28 Jan 2015 09:24:39 +1100
Christopher Samuel <samuel at unimelb.edu.au> wrote:
> On 24/01/15 01:29, Lennart Karlsson wrote:
> > This reminds me of when we upgraded to SL-6.6 (approximately the
> > same as CentOS-6.6 and RHEL-6.6).
> > The new kernel we got, could not handle our IPoIB for storage
> > traffic, which broke down within a few hours.
> Interesting, we use GPFS over IPoIB and upgraded to RHEL 6.6 in early
> November and haven't seen any issues at all (and with a lot of
> bioinfomatics users we'd notice problems pretty quickly).
Redhat has confirmed that there are multiple issues with ipoib in 6.6
and there is a thread for testing fixes at:
[PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
The problem is most easily demonstrated by restarting the SM and then
bringing up new ipoib interfaces on 6.6 hosts. This creates islands of
We are currently running a 6.6 kernel with the entire ulp/ipoib
directory reverted to 6.5.
> Is your IB running in connected mode or datagram mode?
> We're in connected mode everywhere because of our BG/Q.
> All the best,
More information about the Beowulf