intermittent crashing of programs

Donald Becker becker at scyld.com
Thu Feb 21 08:43:01 PST 2002


On Thu, 21 Feb 2002, Kris Thielemans wrote:

> (2nd resubmit after subscribing with a different email address...)

OK, I just deleted them from the moderation-hold queue.
I usually approve held posts in a few hours during the week.  The volume
of attempted spam has become very high in the past few months, so I'm
unlikely to loosen the requirement that non-member messages be held for
moderation.


> we have a cluster of 4 dual Pentium III 600 MHz systems, running SuSE Linux
> 7.1. On one of the PCs, our programs occasionally crash with a segmentation
> fault. This also happens with an ordinary serial program with all its IO to
> local disks. (It does use NIS to get user info though, so I cannot easily
> test it without network). The crash NEVER occurs on any of the other
> systems.

This is pretty clearly a hardware problem.  Luckily you have other
similar system to compare against.

> Feb 21 14:22:58 pp4 kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Feb 21 14:22:58 pp4 kernel: You probably have a hardware problem with your
> RAM chips

Hmmm, there is a similar problem reported in the eepro100 list on a Dell
4400 server.  There the problem occurs when a PCI device is accessed
(and of course the driver is blamed).  I'm guessing that problem
is a datapath parity error, which is slightly different than a PCI
parity error.

You might want to read that thread which starts 16 Feb 2002.
   http://www.scyld.com/pipermail/eepro100/2002-February/

The important detail to remember is that NMI is once again being used to
report system data errors, there are additional error sources beyond
memory parity errors.

> So, we ran memtest86-2.5 for 4 days continuously. No error was reported.

I would swap RAM between two systems and see if the problem follows.  If
the problem just goes away, you should still relegate the suspect RAM to
a machine that doesn't need to be reliable.


Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list