intermittent crashing of programs
becker at scyld.com
Thu Feb 21 08:43:01 PST 2002
On Thu, 21 Feb 2002, Kris Thielemans wrote:
> (2nd resubmit after subscribing with a different email address...)
OK, I just deleted them from the moderation-hold queue.
I usually approve held posts in a few hours during the week. The volume
of attempted spam has become very high in the past few months, so I'm
unlikely to loosen the requirement that non-member messages be held for
> we have a cluster of 4 dual Pentium III 600 MHz systems, running SuSE Linux
> 7.1. On one of the PCs, our programs occasionally crash with a segmentation
> fault. This also happens with an ordinary serial program with all its IO to
> local disks. (It does use NIS to get user info though, so I cannot easily
> test it without network). The crash NEVER occurs on any of the other
This is pretty clearly a hardware problem. Luckily you have other
similar system to compare against.
> Feb 21 14:22:58 pp4 kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Feb 21 14:22:58 pp4 kernel: You probably have a hardware problem with your
> RAM chips
Hmmm, there is a similar problem reported in the eepro100 list on a Dell
4400 server. There the problem occurs when a PCI device is accessed
(and of course the driver is blamed). I'm guessing that problem
is a datapath parity error, which is slightly different than a PCI
You might want to read that thread which starts 16 Feb 2002.
The important detail to remember is that NMI is once again being used to
report system data errors, there are additional error sources beyond
memory parity errors.
> So, we ran memtest86-2.5 for 4 days continuously. No error was reported.
I would swap RAM between two systems and see if the problem follows. If
the problem just goes away, you should still relegate the suspect RAM to
a machine that doesn't need to be reliable.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf