[Beowulf] Killing nodes with Open-MPI?

Lance Wilson lance.wilson at monash.edu
Thu Oct 26 14:58:02 PDT 2017


Hi Chris,
We are running CX4 cards and have had some issues as well. Which version/s
of openmpi are they running?

If you follow the instructions from Mellanox and run with yalla and mxm
that works(ish) of openmpi 1.10.3, including setting the appropriate
environment variables or config file.

If they are running the 2.1 series from openmpi there are some issues with
compiling in the mellanox drivers.

We haven't seen any hard locks like this but we have seen a whole bundle of
other issues.

Cheers,

Lance
--
Dr Lance Wilson
Characterisation Virtual Laboratory (CVL) Coordinator &
Senior HPC Consultant
Ph: 03 99055942 (+61 3 99055942)
Mobile: 0437414123 (+61 4 3741 4123)
Multi-modal Australian ScienceS Imaging and Visualisation Environment
(www.massive.org.au)
Monash University

On 26 October 2017 at 22:42, Chris Samuel <samuel at unimelb.edu.au> wrote:

> Hi folks,
>
> I'm helping another group out and we've found that running an Open-MPI
> program, even just a singleton, will kill nodes with Mellanox ConnectX 4
> and 5
> cards using RoCE (the mlx5 driver).   The node just locks up hard with no
> OOPS
> or other diagnostics and has to be power cycled.
>
> Disabling openib/verbs support with:
>
> export OMPI_MCA_btl=tcp,self,vader
>
> stops the crashes, and whilst it's hard to tell strace seems to imply it
> hangs
> when trying to probe for openib/verbs devices (or shortly after).
>
> Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and
> I'm
> reasonably convinced this has to be a driver bug, or perhaps a bad
> interaction
> with recent 4.11.x and 4.12.x kernels (they need those for CephFS).
>
> They've got a bug open with Mellanox already but I was wondering if anyone
> else had seen anything similar?
>
> cheers!
> Chris
> --
>  Christopher Samuel        Senior Systems Administrator
>  Melbourne Bioinformatics - The University of Melbourne
>  Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20171027/b0fe4510/attachment.html>


More information about the Beowulf mailing list