Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

intermittent crashing of programs

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Donald Becker becker at scyld.com
Thu Feb 21 08:43:01 PST 2002


On Thu, 21 Feb 2002, Kris Thielemans wrote:

> (2nd resubmit after subscribing with a different email address...)

OK, I just deleted them from the moderation-hold queue.
I usually approve held posts in a few hours during the week.  The volume
of attempted spam has become very high in the past few months, so I'm
unlikely to loosen the requirement that non-member messages be held for
moderation.


> we have a cluster of 4 dual Pentium III 600 MHz systems, running SuSE Linux
> 7.1. On one of the PCs, our programs occasionally crash with a segmentation
> fault. This also happens with an ordinary serial program with all its IO to
> local disks. (It does use NIS to get user info though, so I cannot easily
> test it without network). The crash NEVER occurs on any of the other
> systems.

This is pretty clearly a hardware problem.  Luckily you have other
similar system to compare against.

> Feb 21 14:22:58 pp4 kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Feb 21 14:22:58 pp4 kernel: You probably have a hardware problem with your
> RAM chips

Hmmm, there is a similar problem reported in the eepro100 list on a Dell
4400 server.  There the problem occurs when a PCI device is accessed
(and of course the driver is blamed).  I'm guessing that problem
is a datapath parity error, which is slightly different than a PCI
parity error.

You might want to read that thread which starts 16 Feb 2002.
   http://www.scyld.com/pipermail/eepro100/2002-February/

The important detail to remember is that NMI is once again being used to
report system data errors, there are additional error sources beyond
memory parity errors.

> So, we ran memtest86-2.5 for 4 days continuously. No error was reported.

I would swap RAM between two systems and see if the problem follows.  If
the problem just goes away, you should still relegate the suspect RAM to
a machine that doesn't need to be reliable.


Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list