[Beowulf] Tyan 2466 crashes, no obvious reason why

Robert G. Brown rgb at phy.duke.edu
Mon Sep 6 07:45:16 PDT 2004


On Sun, 5 Sep 2004, David Mathog wrote:

> After a few more crashes with nothing in the log files a shell
> script was run that logged all sensors readings every 10 seconds
> to a file.  When it next crashed (6 hours after a restart) there
> was no significant difference between any of the numbers, be
> they voltage, RPM, or Temp.  
> 
> I would have expected that if the power supply or on board
> voltage regulator was flaking out it would most likely result
> in noise showing up in sensors - but it didn't.
> 
> This time I also left a monitor plugged into the node
> and was greeted by this message on the down machine:
> 
> CPU 0:  Machine Check Exception: 000000000000004
> Bank 0: e67aa00000000833 at 000000003f9c8688
> Bank 1: f600200000000853 at 00000000001ab948
> 
> Kernel panic CPU context corrupt
> In interrupt handler - not syncing
> 
> That message must be new though, because when I plugged in
> that monitor the system had recently crashed, and there
> was nothing on the screen then.

fwiw, we have had pretty miserable total experiences with the entire
246x line from tyan.  The 2460 was openly broken, the 2466 works ok but
is damned finicky and breaks easily.

IIRC, the 2466's come with a three year warranty from Tyan and the
processors typically are also warranted by AMD (depending a bit on
where/how you got the systems).  The original CPU fans distributed by
AMD with the CPUs totally suck.  We have had a tremendous failure rate
with them -- literally a box or so to send back and/or replace
ourselves, maybe 25-35% of our total cluster.  We have found that a
dying fan is a common source of trouble in a cluster node -- if it goes
up, stays up for a while, then crashes chances are good that it is a
load/heat related problem and that as soon as the load reaches a
critical point a slightly dying fan can no longer keep one or the other
CPU cool enough and it destabilizes and the system crashes.

Sometimes the fans die all the way.  Sometimes the underlying CPUs then
cook (we also have a smaller pile of cooked CPUs).  Sometimes the power
supplies themselves smoke.  Sometimes the smoking PS's take other system
components with them (or rather, it may be that a smoking motherboard is
shorted internally enough to take the PS with it).  We have had pretty
low reliability overall in these systems, to the point where we have
only RARELY had our entire 2466 cluster up and running perfectly.  This
is in strict counterpoint to the e.g. Opteron cluster(s), that have
functioned perfectly since powerup.  It isn't about AMD (except for the
fan issue, which they have owned and are willingly replacing any fans
that we have troubles with).  It is to some extent about Tyan -- it
would take some effort to convince us at this point that Tyan's
motherboards are built with tremendously great quality control, and the
2460 was a motherboard that they should have just swapped for 2466's
across the board for free it was so bad.

> The motherboard capacitors have all been visually inspected
> and none of them are leaking, bulging, or otherwise showing
> signs of failure.
> 
> memtest86 is running now (and for the next 36 hours or so) but
> if it doesn't find anything, does the console error suggest
> a region of memory to test more intensively, or a particular test
> to run in memtest86???
> 
> Looks like I'm going to need a bunch of spare parts for a "fun"
> game of "swap components and wait for the crash"...

That is exactly what we do.  Only rarely do we get a clean signal of
failure, and we have 2-3 system running at any time that crash
intermittently just as you describe.  We know that it is hardware only
because all the systems are identical and running identical tasks, and
the failures tend to appear in particular systems and then persist in
those systems until eventually they crash all the way.

In order, for an intermittant crash we suspect:

  a) The CPU fans.  Knee jerk replace if there is ANY wobble, noise,
visible difference in speed (don't trust sensors output for fan speed or
temperature).

  b) The case fans.  Athlons are notoriously sensitive to heat, and
obstructions in case flow, insufficient flow (too small fans), hot
components between the intake and the CPUs all can cause case
overheating and destabilize memory or possible components on the
motherboard (who knows?).

  c) The power supply.  Just because it is one of the most common parts
to fail, although fortunately they tend to fail all the way (often with
smoke) or not at all.

  d) Roughly equally, motherboard, memory, other components, gremlins,
CPU.  Again motherboards USUALLY fail catastrophically (eventually) but
a few systems' flakiness has followed the motherboard and then the
motherboard has died the rest of the way.  Memory failure is not
uncommon, but memtest86 or at worst the tried and true swap-the-DIMM
game usually finds the culprit, eventually.  "Other components" is
actually pretty rare but you gotta look.  CPUs definitely fail (often
associated with failure of fan(s), motherboard, PS) and sometimes they
fail slowly -- flaking out under load or intermittantly.  Swapping the
CPUs around has proven that some CPUs are perfectly capable of booting a
system as CPU0, running for a day, and then exhibiting a fault.  If a
CPU has EVER been overheated this is not even that unlikely.  Since we
have had a lot more fan failures than CPU failures, we have plenty of
CPUs at risk.

Basically, we're just trying to keep our 2466's going until new grant
money buys replacement nodes and we can sanely retire them.  They are
all less than 3 years old, though, and we really need them to run one to
two more years.  We >>have<< gotten lots of work done on them, and they
>>do<< run well a lot of the time -- just not a terribly stable design
and SO sensitive to heat and power problems...


Hope this helps,

  rgb (with WAY more experience fixing 2466's that he ever hoped to
accrue).



> 
> 
> Thanks,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list