Myrinet hardware reliability

Gerry Creager N5JXS gerry.creager at tamu.edu
Sat Feb 8 06:20:11 PST 2003


Have you seen any indications of power supply problems on the 
problemmatic cluster?

Gerry

Victoria Pennington wrote:
> Hi,
> 
> We have a 113 node IBM x330 cluster with Myrinet 2000.  We're
> experiencing very high failure rates on Myrinet switch ports
> (average 3 per month) and on Myrinet NICs to a lesser extent
> (about 1 per month).  Ports and NICs are fine one minute,
> then one or the other just dies (for good).  Cables
> (fibre, not copper) seem fine - one or two failures only in
> nearly a year.
> 
> There is no pattern in the failures, and they are entirely
> unrelated to usage levels; seldom used nodes are just as
> likely to have failures as heavily used nodes.
> 
> We have another small IBM cluster with Myrinet 2000
> (16 port switch with copper cables), and this has run solidly
> for nearly 2 years with not one Myrinet hardware fault.
> 
> I'd be really interested to know of others' experiences with
> Myrinet kit, especially in larger clusters.
> 
> Thanks
> Victoria
> ---
> Dr Victoria Pennington
> Manchester Computing, Kilburn Building,
> University of Manchester,
> Oxford Road, Manchester M13 9PL
> tel. 0161 275 6830, email: v.pennington at man.ac.uk
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf





More information about the Beowulf mailing list