<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body dir="auto">
Where is this driver from? OS, or OFED, or?
<div><br>
</div>
<div>We use primarily MVAPICH2 but I would be curious to try to duplicate this on our mlx5 equipment. </div>
<div><br>
</div>
<div>What model cards do you have?<br>
<br>
<div id="AppleMailSignature"><span style="background-color: rgba(255, 255, 255, 0);">--<br>
____<br>
|| \\UTGERS,       |---------------------------*O*---------------------------<br>
||_// the State     |         Ryan Novosielski - <a href="mailto:novosirj@rutgers.edu" dir="ltr" x-apple-data-detectors="true" x-apple-data-detectors-type="link" x-apple-data-detectors-result="1">novosirj@rutgers.edu</a><br>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus<br>
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark<br>
    `'</span></div>
<div><br>
On Oct 26, 2017, at 07:43, Chris Samuel <<a href="mailto:samuel@unimelb.edu.au">samuel@unimelb.edu.au</a>> wrote:<br>
<br>
</div>
<blockquote type="cite">
<div><span>Hi folks,</span><br>
<span></span><br>
<span>I'm helping another group out and we've found that running an Open-MPI </span>
<br>
<span>program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
</span><br>
<span>cards using RoCE (the mlx5 driver).   The node just locks up hard with no OOPS
</span><br>
<span>or other diagnostics and has to be power cycled.</span><br>
<span></span><br>
<span>Disabling openib/verbs support with:</span><br>
<span></span><br>
<span>export OMPI_MCA_btl=tcp,self,vader</span><br>
<span></span><br>
<span>stops the crashes, and whilst it's hard to tell strace seems to imply it hangs
</span><br>
<span>when trying to probe for openib/verbs devices (or shortly after).</span><br>
<span></span><br>
<span>Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm
</span><br>
<span>reasonably convinced this has to be a driver bug, or perhaps a bad interaction
</span><br>
<span>with recent 4.11.x and 4.12.x kernels (they need those for CephFS).</span><br>
<span></span><br>
<span>They've got a bug open with Mellanox already but I was wondering if anyone </span>
<br>
<span>else had seen anything similar?</span><br>
<span></span><br>
<span>cheers!</span><br>
<span>Chris</span><br>
<span>-- </span><br>
<span>Christopher Samuel        Senior Systems Administrator</span><br>
<span>Melbourne Bioinformatics - The University of Melbourne</span><br>
<span>Email: <a href="mailto:samuel@unimelb.edu.au">samuel@unimelb.edu.au</a> Phone: +61 (0)3 903 55545</span><br>
<span></span><br>
<span>_______________________________________________</span><br>
<span>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing</span><br>
<span>To change your subscription (digest mode or unsubscribe) visit <a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cnovosirj%40rutgers.edu%7C919d4d1a79fe443eaa1608d51c66c114%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636446150021038393&sdata=ZTHOeZxgYMtG7XVnZJw3BebEz4rypdmkCuW3ZVraLiQ%3D&reserved=0">
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cnovosirj%40rutgers.edu%7C919d4d1a79fe443eaa1608d51c66c114%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636446150021038393&sdata=ZTHOeZxgYMtG7XVnZJw3BebEz4rypdmkCuW3ZVraLiQ%3D&reserved=0</a></span><br>
</div>
</blockquote>
</div>
</body>
</html>