[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?
Nifty Tom Mitchell
niftyompi at niftyegg.com
Mon Apr 6 19:08:12 PDT 2009
On Mon, Apr 06, 2009 at 06:18:05PM -0500, Rahul Nabar wrote:
> On Mon, Apr 6, 2009 at 2:37 PM, John Bushnell
> <john.bushnell at icb.ucsb.edu> wrote:
> > I never call support until after I have diagnosed a problem myself as much
> > as possible. One of the advantages of buying a batch of nodes at once is
> > that you can easily swap components between nodes to isolate the real
> > problem. You will find Dell support easier to deal with (or any other
> > vendor for that matter) if you can concisely tell them all of the steps that
> > you took to determine that component X needs replacement. Yes, I have had a
> > bad vendor give me the run around, but the better information that you can
> > put into an initial service call, the better service you will tend to
> > receive. If you just say "my node stopped working", they will assume that
> > you don't know what you're doing.
> What puzzles me is this:
> Someone had to write the code that produces the error that my
> baseboard controller logs:
> "Critical error; Voltage sensor (VCORE) critical error. State asserted
> CPU2" etc.
> In all my naivete I'd expect it to be a branch responding to some
> error condition. Why is it being so hard for the vendor to at least
> single out which chip or component that error was designed to flag?
> One could argue "many conditions can result in this specific error"
> but then again what's the point behind a trap so generic.
> I wish I could pore over the source myself just for kicks. Or if
> somehow I could get access to the guy who coded the firmware on that
There is something important here. The BMC is reporting a hardware
error. This is not a Linux/RHEL/SciLinux/Fedora/anything code path.
Pure and simple Dell hardware.
T o m M i t c h e l l
Found me a new hat, now what?
More information about the Beowulf