[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
rpnabar at gmail.com
Mon Apr 6 12:35:10 PDT 2009
On Mon, Apr 6, 2009 at 1:32 PM, Frank Gruellich
<frank.gruellich at navteq.com> wrote:
> IMHO SC1435 are some kind of low-cost metal from DELL. I would not use
> them if I want a reliable system. Especially in HPC where one failed
> systems ruins your whole (maybe long running) job.
Thanks for the comments Frank. I did not realize that the SC1435
wasn't suitable for HPC. I know it is one of the lower end systems
without RemoteManagement nor hot-swappable-hardware etc. (but we don't
really need the frills) but I was under the impression that this model
is fairly common in other HPC installations. Maybe we were wrong, in
> The DELL support is a bit tricky. We have Silver or Gold support for
> most systems, I don't know how they work for lower levels. I can't
> complain about Gold. For Silver they always try to make us doing stuff
> like cross testing memory, CPU or other things. (The most interesting
> request is to do a BIOS update to cure a (obviously) memory problem.
> The machine went 2 years fine with the old BIOS -- memory combination
> and suddenly it complains about it?) While I really like to do such
> hardware games I just don't have the time for it. If you keep refusing
> these requests, eventually they give up and send a technican replacing
> different pieces of hardware.
I ought to check if we are "Gold" or "Silver" or none. Yes, the BIOS
update gig I am familiar with. I can quote their debug checklist from
memory almost. They made me confirm and update BIOSes too. It was
funny especially since it hadn't been even a month after we bought
them but the tech insisted our BIOS was *not* up-to-date back then. We
fixed it but I always wonder why they do not just ship out up-to-date
versions of the BIOS!
> We use CentOS for most installation and DELL support never complained
> about it. And IMHO the OS should be able to cause an error detected by
> the management board.
Exactly, my opinion. It seems clearly a hardware level fault and the
OS angle seems mostly smoke-and-mirrors to me. I cannot explain why
the system will not reboot by pressing the reboot button if it were a
simple software crash.
> I have dset reports in place, before calling support, because they
> always request them. That speeds up chit-chat a bit.
Yes, dset and sosreports seem standard requests.
> That's another problem: IMHO your university should have a dedicated guy
> taking care about computer system, someone who has the time to deal with
> DELL support and so on. 23 machines don't give a full time job, but
> maybe someone who's taking care about some other Linux installation
> already. It's not a good idea to have just some grad-student doing that
> job part-time (no offense). I know that reallity looks bad.
Ah well, one does what one needs to! :) These are dedicated research
machines for our computational chemistry group so they will be running
code that eventually (hopefully!) puts results into my PhD thesis! :)
Most parts of system administration are fun except maybe having to
deal with stubborn vendors!
More information about the Beowulf