[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
prentice at ias.edu
Mon Apr 6 12:48:48 PDT 2009
Rahul Nabar wrote:
> On Mon, Apr 6, 2009 at 1:32 PM, Frank Gruellich
> <frank.gruellich at navteq.com> wrote:
>> IMHO SC1435 are some kind of low-cost metal from DELL. I would not use
>> them if I want a reliable system. Especially in HPC where one failed
>> systems ruins your whole (maybe long running) job.
> Thanks for the comments Frank. I did not realize that the SC1435
> wasn't suitable for HPC. I know it is one of the lower end systems
> without RemoteManagement nor hot-swappable-hardware etc. (but we don't
> really need the frills) but I was under the impression that this model
> is fairly common in other HPC installations. Maybe we were wrong, in
Actually, I believe the SC1435 are designed for HPC and clustering. Our
cluster from Dell was spec'ed with them. I thought the SC stood for
"Scientific Computing" or something like that, not "Super Cheap". With
no redundancy, SC 1435 are defintitely less robust hardware than normal
PowerEdge server, but I thought that was the point - you don't need that
in a cluster node since the nodes themselves provide redundancy.
And they do have RemoteManagement. Mine all have IPMI 2.0-compliant BMC
cards in them. I've been powering mine up and down remotely for months now.
One other problem I've been having with mine: The thumbscrews that hold
them into the racks are made of aluminum, which not a good material for
screws. I've already cross-threaded several with minimal effort, and
once they're cross-threaded, there's almost now what to get them out. I'
used Craftsman screw-outs, but because the aluminum was so soft, I just
ended up shaving the heads off. More brutish methods just resulted in
the heads snapping right off. Why anyone decided to use aluminum for
these screws instead of steal boggles my mind.
More information about the Beowulf