[Beowulf] Killing nodes with Open-MPI?

Chris Samuel samuel at unimelb.edu.au
Thu Oct 26 04:42:51 PDT 2017


Hi folks,

I'm helping another group out and we've found that running an Open-MPI 
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5 
cards using RoCE (the mlx5 driver).   The node just locks up hard with no OOPS 
or other diagnostics and has to be power cycled.

Disabling openib/verbs support with:

export OMPI_MCA_btl=tcp,self,vader

stops the crashes, and whilst it's hard to tell strace seems to imply it hangs 
when trying to probe for openib/verbs devices (or shortly after).

Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm 
reasonably convinced this has to be a driver bug, or perhaps a bad interaction 
with recent 4.11.x and 4.12.x kernels (they need those for CephFS).

They've got a bug open with Mellanox already but I was wondering if anyone 
else had seen anything similar?

cheers!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545



More information about the Beowulf mailing list