Problems with dual Athlons

Robert G. Brown rgb at phy.duke.edu
Wed Jul 31 08:49:06 PDT 2002


On Wed, 31 Jul 2002, Ray Schwamberger wrote:

> You might try the noapic option. I'm thinking there may be some kind of 
> issues with APIC, AMD and 2.4.18.

We don't have ASUS systems but instead a mix of Tyan 2460 and 2466
systems and see very similar things, including the bizarreness of the
blind crash problems appearing on one system (consistently are
repeatedly) but not another IDENTICAL system sitting right next to it.

We have found that power supplies (both the power line itself and the
switching power supply in the chassis) can make a difference on the
2466's -- a marginal power supply is an invitation to problems for sure
on these beasties.  This is reflected in the completely outrageous
observation that I have some nodes that will boot and run stably when
plugged into certain receptacles on the power pole, but not other
receptacles.  If I put a polarity/circuit tester on the receptacles,
they pass.  If I check the line voltages, they are nominal (120+ VAC).
If I plug any 2466 into them (I tried 3), it fails to POST.  If I move
the plug two receptacles up on the same pole and same circuit, it POSTS,
installs, and works fine.  I haven't put an oscilloscope on the line
when plugging it in, but I'm sure it would be fascinating to do so.

We're also in the problem of investigating kernel snapshot dependencies
and the SMP issues aforementioned as we continue to try to stabilize our
2460's, which seem even more sensitive than the 2466's (which so far
seem to run stably and and give decent performance overall).
Unfortunately, our crashes occur with a mean time of days to a week or
two under load in between (consistent with a rare interrupt conflict or
SMP issue) so it takes a long time to test a potential fix.  We did
avoid a crash for about 9 days on a 2460 running 2.4.18-5 (Red Hat's
build id) after experiencing crashes on the node every 5-10 days, but
are only just now accumulating better statistics on a group of nodes
instead of just the one.

So overall, I concur -- try different smp kernel releases and snapshots,
try rearranging the cards (order often seems to matter) and bios
settings, try --noapic (which we should probably also do -- we haven't
so far) and yes, try rearranging the way the nodes are plugged in.
Notice that this is evil and insidious -- you can pull a node from a
rack and bench it and it will run fine forever, but if you plug it back
in to the same receptacle when you put it back, it has problems.
Maddening.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list