[vortex] 3c59x LK1.1.16 Linux-2.4 PCI bus error/Host error

epl@labyrinth.net.au epl@labyrinth.net.au
Sat Jul 13 11:33:01 2002


   Using a 3com 905B-100BaseTX NIC with the 3c59x driver, I can induce
the module to lock-up by transferring large amounts of data via the
network.

   By bringing down the network I can stop flow of messages to syslog
(see below). If I restart the network without unloading the module, the
network stays locked up and the flow of messages to syslog resumes. If
I stop the network, unload the module and restart the network, I can
sanely use the network until the next lockup.

   I get the following (debug=7) error messages in my syslog:
===
Jul 10 23:32:55 localhost kernel: <7tatus e001
Jul 10 23:32:56 localhost kernel: eth0: vortex_error(), status=0xe081
Jul 10 23:33:02 localhost kernel: eth0: vortex_error(), status=0xe081
Jul 10 23:33:05 localhost kernel: <7ake queue
Jul 10 23:33:05 localhost kernel: eth0: vortex_error(), status=0xe081
Jul 10 23:33:28 localhost last message repeated 4 times
Jul 10 23:33:31 localhost kernel: e401.
Jul 10 23:33:31 localhost kernel: eth0: vortex_error(), status=0xe081
Jul 10 23:33:47 localhost last message repeated 5 times
Jul 10 23:33:48 localhost kernel: eth0: vortex_error(), status=0xe003
Jul 10 23:33:48 localhost kernel: eth0: Host error, FIFO diagnostic register 0000.
Jul 10 23:33:48 localhost kernel: eth0: PCI bus error, bus status 80000020
Jul 10 23:33:48 localhost kernel: eth0: using NWAY device table, not 8
Jul 10 23:33:48 localhost kernel: eth0: MII #0 status 0080, link partner capability 0080, info1 0010, setting half-duplex.
Jul 10 23:33:48 localhost kernel: eth0: vortex_error(), status=0xe003
Jul 10 23:33:48 localhost kernel: eth0: Host error, FIFO diagnostic register 0000.
Jul 10 23:33:48 localhost kernel: eth0: PCI bus error, bus status 80000020
Jul 10 23:33:48 localhost kernel: eth0: using NWAY device table, not 8
Jul 10 23:33:48 localhost kernel: eth0: MII #0 status 0080, link partner capability 0080, info1 0010, setting half-duplex.
<snipped -- continues until network is brought down>
===

   I haven't been able to find a perfect test case to trigger the above
bug. However, scp'ing a large directory from a remote machine (and
repeating if required) seems to work.

   I also suspect that this problem causes the ocassional kernel oops,
although I haven't been able to track it down.

   The hardware is an Intel Pentium-100 with a built-in EIDE controller
and S3 video. The only addon card is a 3c905B 100BaseTX PCI card
working at 10Mbps connecting via a hub.

   I have experienced this bug with both the Red Hat supplied 2.4.9-31
kernel as well as a custom-compiled version of the 2.4.18 kernel. The
captures in this e-mail are all from the 2.4.18 kernel with the
modprobe options of "debug=7". The rest of the system is a Red Hat 7.2
system with the ocassional update patch.

   Looking around the web, it appears that others have encountered
similar behaviour, but that no-one has been able to track it down
sufficiently to fix it. Nonetheless, I should note that:
- Bus-mastering is on. Turning it off might help, but I don't know how.
- I am not using SMP. The kernel doesn't support it and nor does the
  hardware.
- I have removed every other addon cards. The EIDE controllers and
  video are both built-in.
- The machine has two slots for PCI cards. I have reproduced this bug
  in either slot.
- Replacing the 3C905B with a NE2000 PCI NIC eliminates the bug. The
  NE2000 card probably doesn't have bus-mastering.
- With the modprobe options of "debug=7" only, the card is in half-
  duplex. With the additional option of "options=512", full-duplex is
  on. Regardless of the duplex setting, the bug is reproducible.
- On another system (Athlon-class) with the same NIC model (3c905B
  Cyclone 100baseTx) and essentially the same software (Red Hat 7.2),
  no problems occur.
- Therefore, I decided to physically swap the two 3c905B NIC between
  the two systems -- to eliminate the theory that I got one dodgy NIC.
  The Pentium-class system continued to have problems and the Athlon-
  class system continued to be trouble-free.

   I've included some debugging information which may be of use below.
I know this is a hard bug to track down and I'd be glad to perform
any additional debugging required. While "I Am Not A Kernel Hacker",
if difficult debugging is required I'll give it a go. However, some
prior emails indicates that it is a hardware fault. If so, can someone
clarify whether it is the NIC, PC or both?

Output of ``lspci -vx'' as root:
===
00:00.0 Host bridge: Intel Corporation 430FX - 82437FX TSC [Triton I] (rev 02)
	Flags: bus master, medium devsel, latency 64
00: 86 80 2d 12 06 00 00 22 02 00 00 06 00 40 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:07.0 ISA bridge: Intel Corporation 82371FB PIIX ISA [Triton I] (rev 02)
	Flags: bus master, medium devsel, latency 0
00: 86 80 2e 12 0f 00 80 02 02 00 01 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:08.0 VGA compatible controller: S3 Inc. 86c764/765 [Trio32/64/64V+] (prog-if 00 [VGA])
	Flags: VGA palette snoop, medium devsel, IRQ 3
	Memory at fe000000 (32-bit, non-prefetchable) [size=8M]
	Expansion ROM at <unassigned> [disabled] [size=64K]
00: 33 53 11 88 23 00 00 02 00 00 00 03 00 00 00 00
10: 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 03 01 00 00

00:13.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 24)
	Subsystem: 3Com Corporation 3C905B Fast Etherlink XL 10/100
	Flags: bus master, medium devsel, latency 64, IRQ 11
	I/O ports at fc80 [size=128]
	Memory at fffbfc00 (32-bit, non-prefetchable) [size=128]
	Expansion ROM at <unassigned> [disabled] [size=128K]
	Capabilities: [dc] Power Management version 1
00: b7 10 55 90 17 01 10 22 24 00 00 02 08 40 00 00
10: 81 fc 00 00 00 fc fb ff 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 55 90
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0b 01 0a 0a
===

debug=7 output upon NIC initialisation
===
3c59x: Donald Becker and others. www.scyld.com/network/vortex.html
See Documentation/networking/vortex.txt
00:13.0: 3Com PCI 3c905B Cyclone 100baseTx at 0xfc80. Vers LK1.1.16
 00:10:4b:0a:19:95, IRQ 11
  product code 4e47 rev 00.9 date 04-21-98
  8K byte-wide RAM 5:3 Rx:Tx split, autoselect/10baseT interface.
  Enabling bus-master transmits and whole-frame receives.
00:13.0: scatter/gather enabled. h/w checksums enabled
eth0: using NWAY device table, not 0
eth0: MII #0 status 0080, link partner capability 0080, info1 0010, setting half-duplex.
===

Thanks
Eddie