[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Mon Apr 6 16:04:49 PDT 2009

>> IMHO SC1435 are some kind of low-cost metal from DELL.  I would not use
...
> Thanks for the comments Frank. I did not realize that the SC1435
> wasn't suitable for HPC. I know it is one of the lower end systems
> without RemoteManagement nor hot-swappable-hardware etc. (but we don't

remote management is MUCH more of an HPC thing than hotswap.
whether you choose HS or not is a tradeoff involving expected failure rate,
how much it costs you (opportunity, lost work), and expense.  our compute
nodes don't have anything hotswap, and we're happy with that choice.
(they have dual disks which we intended to make up for possibile failure
problems by raid, but that has turned out to be overkill - one disk would
have been fine.)

remote management, otoh, really hinges on whether you have onsite gofers.
I wouldn't consider building even a small cluster without IPMI, given 
that the marginal cost is something like .5-5% per node.  not a frill...

>> like cross testing memory, CPU or other things.  (The most interesting
>> request is to do a BIOS update to cure a (obviously) memory problem.

perfectly reasonable: the bios is responsible for probing the memory
and configuring the memory controller.

>> about it.  And IMHO the OS should be able to cause an error detected by
>> the management board.

uh, very idealistic.  ignores interesting things like OS-loaded CPU firmware,
ACPI, SMI, etc.

> OS angle seems mostly smoke-and-mirrors to me. I cannot explain why
> the system will not reboot by pressing the reboot button if it were a
> simple software crash.

most systems have a "soft" power button these days, which means that 
the OS can hook it and potentially screw up.  obviously a hardware reset
is different.  setting the power button to instant-off, afaik, disables
the acpi-related hooking.  also, magic sysrq may be useful in isolating
whether the kernel is alive (heck, toggling the shift-lock key/led is 
my first test for kernel liveness...)  that said, your problem does sound
like the responsibility of the vendor to me...

>> DELL support and so on.  23 machines don't give a full time job, but
>> maybe someone who's taking care about some other Linux installation
>> already.  It's not a good idea to have just some grad-student doing that
>> job part-time (no offense).  I know that reallity looks bad.

well, I'm part of a shared academic HPC consortium - in part our purpose
is precisely to avoid wasting prof/grad/researcher time on this kind of
thing...