[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
hahn at mcmaster.ca
Mon Apr 6 10:37:11 PDT 2009
> I put these machines into production in Aug '08. Within a month we had
> the first machine go bad. They hang with a amber LED and the
what's the term of the warranty?
> logging-module clearly logs an error of the sort: "Voltage sensor
> (VCORE) critical error. State asserted CPU2". Machine needs a
> power-cycle physically from back-plane to restart
well, I think it's worth asking whether you're sure your power feed
is in good shape.
> Do others face similar vendor issues? If 6 out of 23 machines go bad
> within 8 months of an order can I expect the vendor to exchange the
> rest too?
IMO, no. not without some indication that the fault is well reproducable
and actually fault is theirs...
> And a single bad machine causes larger problems since it usually
> results in disrupting jobs that run spanning across a bunch of nodes
well, if you bought it as a cluster, not just some nodes,
then you might have a case that the cluster is not working.
the problem with replicability is that it permits fingerpointing.
> Just wanting to hear more about how I can best resolve this issue. For
> our future purchases would changing vendors help? Is there any trend
buying an extended warranty might help. buying a shrink-wrapped cluster
might help too.
> behind the quality of services from different vendors? I have only
> been exposed to Dell and its frustrating customer-service so far; are
> HP / IBMd or any others better or worse or uncorrelated?Of course, I
my organization has been an HP shop, more or less, since inception in 2001,
for reasons I won't go into. I believe they've done well by us - I could
criticize prices, some hardware design issues, etc, but they're quite
responsible and responsive to problems.
More information about the Beowulf