Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at mcmaster.ca
Mon Apr 6 10:37:11 PDT 2009


> I put these machines into production in Aug '08. Within a month we had
> the first machine go bad. They hang with a amber LED and the

what's the term of the warranty?

> logging-module clearly logs an error of the sort: "Voltage sensor
> (VCORE) critical error. State asserted CPU2". Machine needs a
> power-cycle physically from back-plane to restart

well, I think it's worth asking whether you're sure your power feed
is in good shape.

> Do others face similar vendor issues? If 6 out of 23 machines go bad
> within 8 months of an order can I expect the vendor to exchange the
> rest too?

IMO, no.  not without some indication that the fault is well reproducable
and actually fault is theirs...

> And a single bad machine causes larger problems since it usually
> results in disrupting jobs that run spanning across a bunch of nodes
> too.

well, if you bought it as a cluster, not just some nodes,
then you might have a case that the cluster is not working.
the problem with replicability is that it permits fingerpointing.

> Just wanting to hear more about how I can best resolve this issue. For
> our future purchases would changing vendors help? Is there any trend

buying an extended warranty might help.  buying a shrink-wrapped cluster
might help too.

> behind the quality of services from different vendors? I have only
> been exposed to Dell and its frustrating customer-service so far; are
> HP / IBMd or any others better or worse or uncorrelated?Of course, I

my organization has been an HP shop, more or less, since inception in 2001,
for reasons I won't go into.  I believe they've done well by us - I could 
criticize prices, some hardware design issues, etc, but they're quite 
responsible and responsive to problems.



More information about the Beowulf mailing list