[Beowulf] Killing nodes with Open-MPI?

Christopher Samuel samuel at unimelb.edu.au
Sun Nov 5 15:52:06 PST 2017


On 26/10/17 22:42, Chris Samuel wrote:

> I'm helping another group out and we've found that running an Open-MPI 
> program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5 
> cards using RoCE (the mlx5 driver).   The node just locks up hard with no OOPS 
> or other diagnostics and has to be power cycled.

It was indeed a driver bug, and is now fixed in Mellanox OFED 4.2 (which
came out a few days ago).

cheers,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545



More information about the Beowulf mailing list