intermittent crashing of programs
becker at scyld.com
Thu Feb 21 08:48:45 PST 2002
On Thu, 21 Feb 2002, Patrick Geoffray wrote:
> Kris Thielemans wrote:
> > Any suggestions on how we figure out what the problem is (aside from
> > replacing all memory chips)? Is it necessarily RAM, or could it be e.g. the
> > hard disk controller or so?
> It's usually RAM, but it can also be a PCI device whining. I have seen
> NMIs from SCSI boards when they were waiting too long to access the PCI
> bus for example.
Could you elaborate? What PCI problems cause a NMI, and on which
motherboards. You obviously have some first-hand experience with the
problem. I'm guessing that you have helped many customers debug their
I think of parity errors being connected to NMI as being an obscure
legacy part of the PC architecture, much like the "A20" line being
switched by the keyboard controller. If the backwards compatibility
broke, no one would notice.
> The last time I got one, it was a bad RAM chip and memtest didn't find
> anything. Try to swap memory with another node to see if the NMIs
> migrate with the chips.
A good point: memory tests often fail to find problems that programs
such as 'gcc' trigger immediately. We compile kernels overnight to test
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf