[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
rpnabar at gmail.com
Mon Apr 6 00:54:23 PDT 2009
We had bought 23 Dell-SC1435-PowerEdge servers for our latest cluster
addition mid-2008. These batch of machines has proved to be a total
disaster from Day one. I was looking for suggestions how I should
tackle this. We are a fairly small university setup and I don't have
much experience dealing with these vendor issues.
I put these machines into production in Aug '08. Within a month we had
the first machine go bad. They hang with a amber LED and the
logging-module clearly logs an error of the sort: "Voltage sensor
(VCORE) critical error. State asserted CPU2". Machine needs a
power-cycle physically from back-plane to restart
I contact Dell. Responses range from the clueless to absurd. First,
they convinced us it was Fedora. So I shifted to CentOS. They still
claim CentOS is "unvalidated" but I refuse to spend a fortune to move
over to RHEL like they want me to. I doubt this has anything to do
with our problem anyways. I discussed this problem extensively on the
Beowulf group back then and got many excellent suggestions, thanks!
Then I go through the whole circus running dset, ipmi, sosreport and a
bunch of stress-testing tools they sent me. It all takes a lot of
time. Eventually they send me swaps for the Motherboard and CPU. No
go. Still hangs at random.
>From Sept. 2008 till Jan 2009 I had a total of 5 servers go bad. 5
out of 23 is close to 20% failure rate. Finally they agree to swap a
few servers in their entirety and this solved the problem for those
specific machines. I just suspect the have a bad batch of SC1435's
but they say they do not have any other reports.
Now I have a new machine go down and it's back to wasting my time
going all over those debugging procedures.
Do others face similar vendor issues? If 6 out of 23 machines go bad
within 8 months of an order can I expect the vendor to exchange the
rest too? Or do i have to wait for each machine to individually go
down? In spite of having paid for next-day service each time we have
waited more than a month while Dell goes through all the debugging
circus. The last straw was a Dell-tech-rep who chastened me today
"The next day (service) refers to normal break fix issues that
involve normal parts, since we may be replacing an entire server it
may take longer".
To quantify; "longer" usually means a month+ for us.
And a single bad machine causes larger problems since it usually
results in disrupting jobs that run spanning across a bunch of nodes
too. I'm just a grad-student here and without a dedicated sys-admin it
takes a lot of time running all these testing etc. that Dell demands;
I am ok running basic tests but if machines go bad during the warranty
is such testing within my domain or the vendors?
I haven't really pored at all the legalese in our contracts but is
there a "lemon-law" analog for computers? If 20% of the machines are
bad in the first one year do you think I can press for a better
resolution from Dell?
Just wanting to hear more about how I can best resolve this issue. For
our future purchases would changing vendors help? Is there any trend
behind the quality of services from different vendors? I have only
been exposed to Dell and its frustrating customer-service so far; are
HP / IBMd or any others better or worse or uncorrelated?Of course, I
do realize that ours is indeed a small setup by today's standards (we
just have 23 SC-1435's) so I am not really one of Dell's high-revenue
Or is this just the way things are and I ought to resign myself to it
rather than fight it out!
More information about the Beowulf