[eepro100] Dell 4400 instability with eepro100 driver...

Donald Becker becker@scyld.com
Mon Feb 18 15:27:00 2002


On Sat, 16 Feb 2002, Henrik Schmiediche wrote:

> I have a single processor Dell 4400 server with 4GB of RAM that I cannot get
> to run stable under high network loads (NFS, remote backups).

This sounds like a hardware problem, and a not a common type of problem.

> I am about ready to trash this system and go back to a Sun.

Don't imagine that Suns don't have obscure hardware problems as well.

[[ Various failures deleted.  ]]

> ...with no success. When I installed the
> latest eepro100 drivers I get this NMI message which may be related to the
> lockups, but I am not sure... I have tried changing RAM with no success.

Yup, it's almost certainly related to the failures.

> Feb 16 07:58:38 s0 kernel: eepro100.c:v1.20 1/28/2002 Donald Becker
> <becker@scyld.com>
> Feb 16 07:58:38 s0 kernel:   http://www.scyld.com/network/eepro100.html
> Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Feb 16 07:58:38 s0 kernel: You probably have a hardware problem with your
> RAM chips

This message might be a little misleading.  Given that you get the NMI
message just when the driver is accessing the NIC over the PCI bus, my
guess is that you seeing PCI bus problems.  These are signaled over the
NMI interrupt, similar to other detected data transfer errors.

The most common reason for a NMI is memory parity errors.

> Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Feb 16 07:58:38 s0 kernel: You probably have a hardware problem with your
> RAM chips
> Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received for unknown reason 25.

That number '25' is the key to understanding how your machine is broken.
My guess is that you are getting PCI bus address or data parity errors.

> Feb 16 07:58:38 s0 kernel: Dazed and confused, but trying to continue
> Feb 16 07:58:38 s0 kernel: Do you have a strange power saving mode enabled?

Here is where the kernel gives up reporting further errors to avoid
filling the log.

> Feb 16 07:58:38 s0 kernel: eth0: Intel i82559 rev 8 at 0xf899f000,
> 00:B0:D0:20:87:60, IRQ 14.
> Feb 16 07:58:38 s0 kernel:   Board assembly 07195d-000, Physical connectors
> present: RJ45
> Feb 16 07:58:38 s0 kernel:   Primary interface chip i82555 PHY #1.
> Feb 16 07:58:38 s0 kernel:   General self-test: passed.
> Feb 16 07:58:38 s0 kernel:   Serial sub-system self-test: passed.
> Feb 16 07:58:38 s0 kernel:   Internal registers self-test: passed.
> Feb 16 07:58:38 s0 kernel:   ROM checksum self-test: passed (0x04f4518b).

All tests passed.  This hints that the errors are occuring when the NIC
is a PCI target, not a PCI master.

> The error message I get (a whole lot of them):
> 
> Feb 15 23:35:22 s0 kernel: Command 0080 was not immediately accepted, 10001
> ticks!

...but I could be wrong about that.

>    - The eepro100  card shares an interrupt with the SCSI controller. Is
> there a way to reassign the IRQ of the eepro100 card?

Perhaps, in the BIOS or physically moving the card.  But that's unlikely
the problem.

>    - The system is even more unstable when I install a second CPU.

Yup.  Could be errors on the memory coherency trafffic.

>  Any ideas on what to try? Bad motherboard?

Yes, likely a bad motherboard.

> NMI:          3

Hmmm, I expect that this count increases over time.  I would track down
the exact access that triggers the NMI.  But then again, I can pretend
that I'm doing that to write better more informative error messages and
diagnostics.  (In reality I just like making things work, even when it
doesn't make economic sense.)

You should just replace the hardware.


> [root@s0:/var/log]# mii-diag
> Using the default interface 'eth0'.
> Basic registers of MII PHY #1:  3000 782d 02a8 0154 05e1 41e1 0003 0000.

Thanks for remembering the driver detection message and diagnostic
information.  This wasn't needed here, but it is for most problems.

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993