[Beowulf] Geriatric computer does not stay up

Eric Thibodeau kyron at neuralbs.com
Mon Dec 21 11:05:45 PST 2009


This smells like the hell I went through when one of the CPUs needed to be changed in our dep's Tyan VX50... Try swapping CPUs if you have spares.

ET
On 2009-12-16, at 5:36 PM, Jack Carrozzo wrote:

> I assume you've done this but forgot to mention it in the email - did
> you test the RAM?
> 
> -Jack Carrozzo
> 
> On Wed, Dec 16, 2009 at 5:27 PM, David Mathog <mathog at caltech.edu> wrote:
>> So we have a cluster of Tyan S2466 nodes and one of them has failed in
>> an odd way. (Yes, these are very old, and they would be gone if we had a
>> replacment.)  On applying power the system boots normally and gets far
>> into the boot sequence, sometimes to the login prompt, then it locks up.
>>  If booted failsafe it will stay up for tens of minutes before locking.
>>  It locked once on "man smartctl" and once on "service network start".
>> However, on the next reboot, it didn't lock with another "man smartctl",
>> so it isn't like it hit a bad part of the disk and died.  Smartctl test
>> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
>> healthy with no blocks swapped out.  Power stays on when it locks, and
>> the display remains as it was just before the lock.  When it locks it
>> will not respond to either the keyboard or the network.  (The network
>> interface light still flashes.)  There is nothing in any of the logs to
>> indicate the nature of the problem.
>> 
>> The odd thing is that the system is remarkably stable in some ways.  For
>> instance, the PS tests good and heat isn't the issue: after running
>> sensors in a tight loop to a log file, waiting for it to lock up, then
>> looking at the log on the next failsafe boot, there were negligible
>> fluctuation on any of the voltages, fan speeds, or temperatures.  It
>> will happily sit for 30 minutes in the BIOS, or hours running memtest86
>> (without errors).  The motherboard battery is good, and the inside of
>> the case is very clean, with no dust visible at all.  Reset the BIOS but
>> it didn't change anything.
>> 
>> Here are my current hypotheses for what's wrong with this beast:
>> 
>> 1. The drive is failing electrically, puts voltage spikes out on some
>> operations, and these crash the system.
>> 2. The motherboard capacitors are failing and letting too much noise in.
>>  The noise which is fatal is only seen on an active system, so sitting
>> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
>> no swelling, no leaks.)  It will run memtest86 overnight though, just in
>> case.
>> 3. The PS capacitors are failing, so that when loaded there is enough
>> voltage fluctuation to crash the system.  (Does not agree very well with
>> the sensors measurements, but it could be really high frequency noise
>> superimposed on a steady base voltage.)
>> 4. Evil Djinn ;-(
>> 
>> Any thoughts on what else this might be?
>> 
>> Thanks.
>> 
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list