[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
frank.gruellich at navteq.com
Mon Apr 6 11:32:56 PDT 2009
Rahul Nabar wrote:
> We had bought 23 Dell-SC1435-PowerEdge servers for our latest cluster
> addition mid-2008.
IMHO SC1435 are some kind of low-cost metal from DELL. I would not use
them if I want a reliable system. Especially in HPC where one failed
systems ruins your whole (maybe long running) job.
I found some SC1435 we bought before April 2008 and some before July
2008, but both deliveries had only minor problems after initial set up
(as usual). From time to time a PDU or memory goes bad. But we do not
have the load of HPC, so maybe our hardware has less stress. We had a
cluster of 5 machines (DELL PE1950) from one delivery where we had to
replace most of all PDUs (redundant, so no big deal). These machines
have a bit more load, though I don't believe in a connection. Sometimes
DELL has just bad days.
> I contact Dell. Responses range from the clueless to absurd.
The DELL support is a bit tricky. We have Silver or Gold support for
most systems, I don't know how they work for lower levels. I can't
complain about Gold. For Silver they always try to make us doing stuff
like cross testing memory, CPU or other things. (The most interesting
request is to do a BIOS update to cure a (obviously) memory problem.
The machine went 2 years fine with the old BIOS -- memory combination
and suddenly it complains about it?) While I really like to do such
hardware games I just don't have the time for it. If you keep refusing
these requests, eventually they give up and send a technican replacing
different pieces of hardware.
> First, they convinced us it was Fedora. So I shifted to CentOS.
> They still claim CentOS is "unvalidated" but I refuse to spend a
> fortune to move over to RHEL like they want me to.
We use CentOS for most installation and DELL support never complained
about it. And IMHO the OS should be able to cause an error detected by
the management board.
> Then I go through the whole circus running dset,
I have dset reports in place, before calling support, because they
always request them. That speeds up chit-chat a bit.
> Now I have a new machine go down and it's back to wasting my time
> going all over those debugging procedures.
IMHO that's not your job. I don't know about your setup, but if you
maybe have some dozens more machines next to these particluar set of
machines all running without problems I would blame DELL.
> In spite of having paid for next-day service each time we have
> waited more than a month while Dell goes through all the debugging
Keep refusing to do that. You don't have the time. They can do this
debugging after they have replaced your hardware giving you 100% back of
what you paid for.
> "The next day (service) refers to normal break fix issues that
> involve normal parts, since we may be replacing an entire server it
> may take longer".
I would not accept that.
> I'm just a grad-student here and without a dedicated sys-admin it
> takes a lot of time running all these testing etc. that Dell demands;
> I am ok running basic tests but if machines go bad during the warranty
> is such testing within my domain or the vendors?
That's another problem: IMHO your university should have a dedicated guy
taking care about computer system, someone who has the time to deal with
DELL support and so on. 23 machines don't give a full time job, but
maybe someone who's taking care about some other Linux installation
already. It's not a good idea to have just some grad-student doing that
job part-time (no offense). I know that reallity looks bad.
Navteq (DE) GmbH
Map24 Systems and Networks
Duesseldorfer Strasse 40a
Phone: +49 6196 77756-414
Fax: +49 6196 77756-100
USt-ID-No.: DE 197947163
Managing Directors: Thomas Golob, Alexander Wiegand,
Hans Pieter Gieszen, Martin Robert Stockman
More information about the Beowulf