[Beowulf] Defective Mellanox EDR Switches

Ryan Novosielski novosirj at rutgers.edu
Thu Jun 7 14:41:05 PDT 2018


> On Jun 7, 2018, at 1:04 PM, Ryan Novosielski <novosirj at rutgers.edu> wrote:
> 
>> On Jun 7, 2018, at 9:43 AM, Peter Kjellström <cap at nsc.liu.se> wrote:
>> 
>> On Thu, 7 Jun 2018 03:12:43 +0000
>> Ryan Novosielski <novosirj at rutgers.edu> wrote:
>> 
>>> One slight correction: 100% of our switches with FRU PN 00WE097/PN
>>> 00WE096Y manufactured on 2016-11-28 (quantity 3) have failed, and one
>>> same FRU PN/PN manufactured on 2016-12-15 too. We have another switch
>>> with FRU PN 00WE093/PN 00WE092Y that was manufactured on 2016-11-28
>>> that has so far been OK, but I’m now suspicious of it.
>> 
>> Thanks for the heads up.
>> 
>> To make this data point more valuable, can you add total numbers? That
>> is, how many (similar) switches in total, how many bad/good. And for
>> how long did they run before exhibiting the problem.
> 
> Sure, Peter.
> 
> We only have 6 SB7890 switches currently. All were purchased through Lenovo, and all have Lenovo machine types of 0724-HD6. I don’t think this has much to do with Lenovo, though, apart from reselling them. One of the four that failed is actually a replacement for a physically damaged switch (bad port latch), so that means there is even bad replacement inventory out there. All of the 4 aforementioned 00WE096Y switches have failed, 3 manufactured on 2018-11-28 and 1 manufactured on 2016-12-15. I don’t have an exact date for the first failure or the switch installation, but the failure occurred roughly a year after the manufacturing date. My guess is that they were in service for about 9-10 months, but I can probably narrow that down with a little more effort if it matters.
> 
> The other two are FRU PN 00WE093/PN 00WE92Y (Lenovo MT 0724-HD5). So far so good on those, though I’m now suspicious of the one manufactured on 2016-11-28.
> 
> Additionally, we have two SB7800 switches — FRU PN: 00WE085/PN 00WE084Y. Too new to tell on those — only a few weeks in service. Both were manufactured on 2018-01-08.

Upshot is an advance RMA on anything that has already shown symptoms (so 3 switches for us), and an on-site visit with a software fix to be applied to all other SB7800-class switches; they seem to think all are potentially affected. The sort of timeframe they gave is ~45 minutes of work on our 11 units.

Fun.

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: Message signed with OpenPGP
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180607/6a60a251/attachment-0001.sig>


More information about the Beowulf mailing list