[Beowulf] Killing nodes with Open-MPI?
samuel at unimelb.edu.au
Sun Nov 5 15:52:06 PST 2017
On 26/10/17 22:42, Chris Samuel wrote:
> I'm helping another group out and we've found that running an Open-MPI
> program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
> cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
> or other diagnostics and has to be power cycled.
It was indeed a driver bug, and is now fixed in Mellanox OFED 4.2 (which
came out a few days ago).
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
More information about the Beowulf