Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Rahul Nabar rpnabar at gmail.com
Tue Apr 7 17:28:53 PDT 2009


On Tue, Apr 7, 2009 at 5:12 PM, Matt Lawrence <matt at technoronin.com> wrote:
> See if there is a standalone diagnostics CD for these systems.  If you can
> get the error to occur with it, let Dell fix it from there.

Tried the CD but cannot get the error to occur in a reasonable time
(ran it for about 3 days; baybe I ought to run it longer by taking a
entire node out of production). There are two issues (1) even during
production the error has a statistical mean-time-between-crashes of
around 1 to 2 weeks.  (2) I am not really sure the diagnostic CD tests
all the relevant calls and CPU functions.

I wish there was a "I do X and the error occurs" That would be simple.
This is the class of non-repeatable one-off errors that is hard to
demonstrate.

-- 
Rahul




More information about the Beowulf mailing list