intermittent crashing of programs

Kris Thielemans kris.thielemans at csc.mrc.ac.uk
Thu Feb 21 06:52:41 PST 2002


(2nd resubmit after subscribing with a different email address...)


Hi,

we have a cluster of 4 dual Pentium III 600 MHz systems, running SuSE Linux
7.1. On one of the PCs, our programs occasionally crash with a segmentation
fault. This also happens with an ordinary serial program with all its IO to
local disks. (It does use NIS to get user info though, so I cannot easily
test it without network). The crash NEVER occurs on any of the other
systems.

At the time of the crash, I get the following message in /var/log/messages
-----------------------------------------------------------------
Feb 21 14:22:58 pp4 kernel: Uhhuh. NMI received. Dazed and confused, but
trying to continue
Feb 21 14:22:58 pp4 kernel: You probably have a hardware problem with your
RAM chips
-----------------------------------------------------------------

So, we ran memtest86-2.5 for 4 days continuously. No error was reported.

Any suggestions on how we figure out what the problem is (aside from
replacing all memory chips)? Is it necessarily RAM, or could it be e.g. the
hard disk controller or so?

Thanks,

Kris Thielemans
(kris.thielemans <at> ic.ac.uk)
Imaging Research Solutions Ltd
Cyclotron Building
Hammersmith Hospital
Du Cane Road
London W12 ONN, United Kingdom

web site address: http://www.irsl.org/~kris




More information about the Beowulf mailing list