[Beowulf] Defective Mellanox EDR Switches

Kilian Cavalotti kilian.cavalotti.work at gmail.com
Thu Jun 7 22:14:27 PDT 2018


Although I don't have the specifics at hand right now, I can confirm that
we've observed the same thing in our installation as well: a couple SB7890
switches that exhibited the same symptoms, after about one year in
production.

We've also seen one SB7800 (ie. managed) failing in the same way. And one
SB7890 that would revert all its ports links to FDR after a few hours.

As a comparison point, we had zero failure on a 48-switch FDR fabric that
has been in production 4 years.

Cheers,
-- 
Kilian


On Thu, Jun 7, 2018, 23:41 Ryan Novosielski <novosirj at rutgers.edu> wrote:

> > On Jun 7, 2018, at 1:04 PM, Ryan Novosielski <novosirj at rutgers.edu>
> wrote:
> >
> >> On Jun 7, 2018, at 9:43 AM, Peter Kjellström <cap at nsc.liu.se> wrote:
> >>
> >> On Thu, 7 Jun 2018 03:12:43 +0000
> >> Ryan Novosielski <novosirj at rutgers.edu> wrote:
> >>
> >>> One slight correction: 100% of our switches with FRU PN 00WE097/PN
> >>> 00WE096Y manufactured on 2016-11-28 (quantity 3) have failed, and one
> >>> same FRU PN/PN manufactured on 2016-12-15 too. We have another switch
> >>> with FRU PN 00WE093/PN 00WE092Y that was manufactured on 2016-11-28
> >>> that has so far been OK, but I’m now suspicious of it.
> >>
> >> Thanks for the heads up.
> >>
> >> To make this data point more valuable, can you add total numbers? That
> >> is, how many (similar) switches in total, how many bad/good. And for
> >> how long did they run before exhibiting the problem.
> >
> > Sure, Peter.
> >
> > We only have 6 SB7890 switches currently. All were purchased through
> Lenovo, and all have Lenovo machine types of 0724-HD6. I don’t think this
> has much to do with Lenovo, though, apart from reselling them. One of the
> four that failed is actually a replacement for a physically damaged switch
> (bad port latch), so that means there is even bad replacement inventory out
> there. All of the 4 aforementioned 00WE096Y switches have failed, 3
> manufactured on 2018-11-28 and 1 manufactured on 2016-12-15. I don’t have
> an exact date for the first failure or the switch installation, but the
> failure occurred roughly a year after the manufacturing date. My guess is
> that they were in service for about 9-10 months, but I can probably narrow
> that down with a little more effort if it matters.
> >
> > The other two are FRU PN 00WE093/PN 00WE92Y (Lenovo MT 0724-HD5). So far
> so good on those, though I’m now suspicious of the one manufactured on
> 2016-11-28.
> >
> > Additionally, we have two SB7800 switches — FRU PN: 00WE085/PN 00WE084Y.
> Too new to tell on those — only a few weeks in service. Both were
> manufactured on 2018-01-08.
>
> Upshot is an advance RMA on anything that has already shown symptoms (so 3
> switches for us), and an on-site visit with a software fix to be applied to
> all other SB7800-class switches; they seem to think all are potentially
> affected. The sort of timeframe they gave is ~45 minutes of work on our 11
> units.
>
> Fun.
>
> --
> ____
> || \\UTGERS,     |---------------------------*O*---------------------------
> ||_// the State  |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
>      `'
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180608/b3eeb6ae/attachment-0001.html>


More information about the Beowulf mailing list