[Beowulf] Infiniband PortXmitWait problems on IBM Sandybridge iDataplex with Mellanox ConnectX-3

Joseph Landman landman at scalableinformatics.com
Wed Jun 12 04:18:37 PDT 2013


Bad FDR cables?  Is it possible that the switches are running slower
due to signaling issues?

Sent from my iPad

On Jun 12, 2013, at 12:03 AM, Christopher Samuel <samuel at unimelb.edu.au> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi folks,
>
> I'm doing the bring up and testing on our SandyBridge IBM iDataplex
> with an FDR switch and as part of that I've been doing burn-in testing
> with HPL and seeing really poor efficiency (~25% over 65 odd nodes
> with 256GB RAM).  Simultaneously HPL on the 3 nodes with 512GB RAM
> gives ~70% efficiency.
>
> Checking the switch with ibqueryerrors shows lots of things like:
>
>   GUID 0x2c90300771450 port 22: [PortXmitWait == 198817026]
>
> That's about 2 or 3 hours after last clearing the counters. :-(
>
> Doing:
>
> # ibclearcounters && ibclearerrors && sleep 1 && ibqueryerrors
>
> Shows 75 of 94 nodes bad, pretty much all with thousands of
> PortXmitWait, some into the 10's of thousands.
>
> We are running RHEL 6.3, Mellanox OFED 2.0.5, FDR IB and Open-MPI 1.6.4.
>
> Talking with another site who also has the same sort of iDataplex, but
> running RHEL 5.8, Mellanox OFED 1.5 and QDR I, reveals that they (once
> they started looking) are also seeing high PortXmitWait counters
> shortly after clearing them with user codes.
>
> These are Mellanox MT27500 ConnectX-3 adapters.
>
> We're talking with both IBM and Mellanox directly, but other than
> Mellanox spotting some GPFS NSD file servers that had bad FDR ports
> (which got unplugged last week and fixed today) we've not made any
> progress into the underlying cause. :-(
>
> Has anyone seen anything like this before?
>
> cheers!
> Chris
> - --
> Christopher Samuel        Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/      http://twitter.com/vlsci
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlG4AQ8ACgkQO2KABBYQAh96awCfRESpDRhVHvpJBqrv33sGlQJm
> NvoAnjg20/xMMcji72eAWI1HzyEQureY
> =GfkH
> -----END PGP SIGNATURE-----
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list