[Beowulf] IPoIB failure

Peter Kjellström cap at nsc.liu.se
Wed Jan 28 02:51:16 PST 2015


On Wed, 28 Jan 2015 09:24:39 +1100
Christopher Samuel <samuel at unimelb.edu.au> wrote:

> On 24/01/15 01:29, Lennart Karlsson wrote:
> 
> > This reminds me of when we upgraded to SL-6.6 (approximately the
> > same as CentOS-6.6 and RHEL-6.6).
> > 
> > The new kernel we got, could not handle our IPoIB for storage
> > traffic, which broke down within a few hours.
> 
> Interesting, we use GPFS over IPoIB and upgraded to RHEL 6.6 in early
> November and haven't seen any issues at all (and with a lot of
> bioinfomatics users we'd notice problems pretty quickly).

Redhat has confirmed that there are multiple issues with ipoib in 6.6
and there is a thread for testing fixes at:

   [PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg22511.html

The problem is most easily demonstrated by restarting the SM and then
bringing up new ipoib interfaces on 6.6 hosts. This creates islands of
connectivity.

We are currently running a 6.6 kernel with the entire ulp/ipoib
directory reverted to 6.5.

/Peter

> Is your IB running in connected mode or datagram mode?
> 
> We're in connected mode everywhere because of our BG/Q.
> 
> All the best,
> Chris



More information about the Beowulf mailing list