[vortex] 3c905-errors

Steven Timm timm@fnal.gov
Mon Oct 6 16:08:00 2003


We have had the following hardware for about a year now
and continue to have mysterious and intermittent network problems on it.

Configuration: Tyan 2466 motherboard, built-in NIC: 3C905TX., dual
AMD MP2000+ processors, 760MPX chipset. 1 GB RAM.  Typical mode of
usage is fast burst transfer of large files, 1-2 GB.  Both NIC
and switch are configured to auto-negotiate.

[root@fnd0228 bin]# lspci
00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11)
00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05)
01:05.0 VGA compatible controller: ATI Technologies Inc Rage Mobility P/M AGP 2x (rev 64)
02:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-768 [Opus] USB (rev 07)
02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)

lsmod shows:
3c59x                  28488   1

[root@fnd0228 bin]# cat /proc/interrupts
           CPU0       CPU1
  0:   13394439   13385273    IO-APIC-edge  timer
  1:          2          0    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  4:        156        162    IO-APIC-edge  serial
  8:          0          1    IO-APIC-edge  rtc
 14:     110215     110156    IO-APIC-edge  ide0
 15:          6         15    IO-APIC-edge  ide1
 19:   25015189   25013968   IO-APIC-level  eth0
NMI:          0          0
LOC:   26779271   26779266
ERR:          0
MIS:         16

dmesg shows:
02:08.0: 3Com PCI 3c905C Tornado at 0x3000. Vers LK1.1.18-ac
 00:e0:81:23:66:5b, IRQ 19
  product code 0000 rev 00.6 date 00-00-00
  Internal config register is 1800000, transceivers 0xa.
  8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
  MII transceiver found at address 24, status 782d.
  Enabling bus-master transmits and whole-frame receives.



We have seen three different and repeating problems.

One: eth0: Too much work in interrupt, status e401.
This has previously been explained to me as not being the fault of the
scyld driver but of the kernel interrupt-handling scheme.
These are accompanied by an increase in the number of "receive FIFO"
errors as shown in /proc/net/dev.  We do not seem to lose
any data on these errors, at least we have never proved that we do.

Two:
Oct  5 23:53:33 fnd0139 kernel: eth0: Setting half-duplex based on MII #24 link partner capability of 0000.
Oct  5 23:54:33 fnd0139 kernel: eth0: Setting full-duplex based on MII #24 link partner capability of 41e1.

These errors are associated with increases in the count of
transmit carrier errors in /proc/net/dev.  They do, for about a minute,
lead to loss of network connectivity with the node. The node
usually recovers afterwards and the connections proceed on as normal.
However, transfers which timed out during this time are not
necessarily restarted and it does cause trouble.  The link light
of the NIC goes out during this time.  We have seen this problem
happen on all of 240 different nodes in a cluster, and have hooked
up test nodes to a variety of different network switches.  The problem
happens with all of them.  Interestingly enough, the switch does
not record any errors in its error counters during these episodes.
We have investigated the physical layer thoroughly and not found
any problems up until now.  Is there any idea what
may be causing this problem..in particular is there any
record that Tyan may have misconfigured this NIC when they put it
on their board?

Note that we also see this error on Tyan 2468 boards which have the 3c980-TX
NIC built-in.

Three:
Errors of this type:

nfs: server d0bbin-farm not responding, still trying
eth0: transmit timed out, tx_status 00 status e601.
  diagnostics: net 0ccc media 8880 dma 0000003a.
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
  Flags; bus-master 1, dirty 4116696(8) current 4116696(8)
  Transmit list 00000000 vs. f711f400.
  0: @f711f200  length 8000002a status 0001002a
  1: @f711f240  length 80000032 status 00010032
  2: @f711f280  length 8000002a status 0001002a
  3: @f711f2c0  length 8000002a status 0001002a
  4: @f711f300  length 800000ba status 000100ba
  5: @f711f340  length 80000032 status 00010032
  6: @f711f380  length 8000002a status 8001002a
  7: @f711f3c0  length 80000032 status 80010032
  8: @f711f400  length 800000ba status 000100ba
  9: @f711f440  length 800000ba status 000100ba
  10: @f711f480  length 800000ba status 000100ba
  11: @f711f4c0  length 800000ba status 000100ba
  12: @f711f500  length 800000ba status 000100ba
  13: @f711f540  length 800000ba status 000100ba
  14: @f711f580  length 80000032 status 00010032
  15: @f711f5c0  length 800000ba status 000100ba

eth0: transmit timed out, tx_status 00 status e681.
  diagnostics: net 0ccc media 8880 dma 0000003a.
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
  Flags; bus-master 1, dirty 4116712(8) current 4116712(8)
  Transmit list 00000000 vs. f711f400.
  0: @f711f200  length 8000002a status 0001002a
  1: @f711f240  length 80000032 status 00010032
  2: @f711f280  length 80000032 status 00010032
  3: @f711f2c0  length 80000032 status 00010032
  4: @f711f300  length 80000032 status 00010032
  5: @f711f340  length 80000032 status 00010032
  6: @f711f380  length 8000002a status 8001002a
  7: @f711f3c0  length 80000032 status 80010032
  8: @f711f400  length 8000002a status 0001002a
  9: @f711f440  length 8000002a status 0001002a
  10: @f711f480  length 8000002a status 0001002a
  11: @f711f4c0  length 80000032 status 00010032
  12: @f711f500  length 80000032 status 00010032
  13: @f711f540  length 80000032 status 00010032
  14: @f711f580  length 8000002a status 0001002a
  15: @f711f5c0  length 80000032 status 00010032
(and so forth).

On the machine that gave this error:
[root@fnd0228 bin]# ./mii-diag
Using the default interface 'eth0'.
Basic registers of MII PHY #24:  3000 782d 0041 6800 05e1 41e1 0007 2801.
 The autonegotiated capability is 01e0.
The autonegotiated media type is 100baseTx-FD.
 Basic mode control register 0x3000: Auto-negotiation enabled.
 You have link beat, and everything is working OK.
 Your link partner advertised 41e1: 100baseTx-FD 100baseTx 10baseT-FD 10baseT.
   End of basic transceiver information.

[root@fnd0228 bin]# ./vortex-diag -a
vortex-diag.c:v2.14 12/28/2002 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a 3c905C Tornado 100baseTx adapter at 0x3000.
 Station address 00:e0:81:23:66:5b.
  Receive mode is 0x07: Normal unicast and all multicast.
The Vortex chip may be active, so FIFO registers will not be read.
To see all register values use the '-f' flag.
Initial window 4, registers values by window:
  Window 0: 0000 0000 0000 0000 adad 00bf ffff 0000.
  Window 1: FIFO FIFO 0700 0000 0000 007f 0000 2201.
  Window 2: e000 2381 5b66 0000 0000 0000 0052 4000.
  Window 3: 0000 0180 05ea 0020 000a 07b8 0728 6601.
  Window 4: 0000 0000 0000 0ecc 0001 9880 0000 8201.
  Window 5: 1ffc 0000 0000 0600 0807 06ce 06c6 a000.
  Window 6: 0000 0000 0000 8400 0000 2194 050c c201.
  Window 7: 0000 0000 0000 0000 0000 0000 0000 e401.
Vortex chip registers at 0x3000
  0x3010: **FIFO** 00000000 00000028 *STATUS*
  0x3020: 00000020 00000000 00080000 00000004
  0x3030: 00000000 fc6b0395 35a480f0 00080004
  0x3040: 0018e7c8 00000000 000000b7 00000000
  0x3050: 00000000 00000000 00000000 00000000
  0x3060: 00000000 00000000 00000000 00000000
  0x3070: 00009000 00000000 01600000 00000000
  DMA control register is 00000032.
   Tx list starts at 00000000.
   Tx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to empty.
   Rx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to full.
   Poll period Tx 00 ns.,  Rx 0 ns.
   Maximum burst recorded Tx 0,  Rx 352.
 Indication enable is 06c6, interrupt enable is 06ce.
 No interrupt sources are pending.
 Transceiver/media interfaces available:  100baseTx 10baseT.
Transceiver type in use:  Autonegotiate.
 MAC settings: full-duplex.
 Station address set to 00:e0:81:23:66:5b.
 Configuration options 0052.
[root@fnd0228 bin]# ./vortex-diag -e
vortex-diag.c:v2.14 12/28/2002 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a 3c905C Tornado 100baseTx adapter at 0x3000.
 Station address 00:e0:81:23:66:5b.
  Receive mode is 0x07: Normal unicast and all multicast.
Saved EEPROM settings of a 3Com Vortex/Boomerang:
 3Com Node Address 00:E0:81:23:66:5B (used as a unique ID only).
 OEM Station address 00:E0:81:23:66:5B (used as the ethernet address).
  Device ID 9200,  Manufacturer ID 6d50.
  Manufacture date (MM/DD/YYYY) 0/0/2000, division , product .
  No BIOS ROM is present.
 Transceiver selection: Autonegotiate.
   Options: negotiated duplex, link beat required.
 PCI Subsystem IDs: Vendor 10f1 Device 2466.
 100baseTx 10baseT.
  Vortex format checksum is incorrect (29 vs. 10f1).
  Cyclone format checksum is incorrect (0xb5 vs. 0xff).
  Hurricane format checksum is incorrect (0x68 vs. 0xff).
[root@fnd0228 bin]# ./vortex-diag -m
vortex-diag.c:v2.14 12/28/2002 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a 3c905C Tornado 100baseTx adapter at 0x3000.
 Station address 00:e0:81:23:66:5b.
  Receive mode is 0x07: Normal unicast and all multicast.
 MII PHY found at address 1, status 0024.
 MII PHY found at address 2, status 0024.
 MII PHY found at address 3, status 0024.
 MII PHY found at address 4, status 0024.
 MII PHY 0 at #1 transceiver registers:
   0000 0024 0000 0000 01e0 41e1 0003 0800
   0000 0000 0000 0000 0000 0000 0000 0000
   0600 c711 0000 4000 0000 0000 0000 0000
   0000 0400 0000 0000 0000 0ae8 0000 0000.
 MII PHY 1 at #2 transceiver registers:
   0000 0024 0000 0000 01e0 41e1 0003 0800
   0000 0000 0000 0000 0000 0000 0000 0000
   0600 c711 0000 4000 0000 0000 0000 0000
   0000 0400 0000 0000 0000 0ae8 0000 0000.
 MII PHY 2 at #3 transceiver registers:
   0000 0024 0000 0000 01e0 41e1 0003 0800
   0000 0000 0000 0000 0000 0000 0000 0000
   0600 c711 0000 4000 0000 0000 0000 0000
   0000 0400 0000 0000 0000 0ae8 0000 0000.
 MII PHY 3 at #4 transceiver registers:
   0000 0024 0000 0000 01e0 41e1 0003 0800
   0000 0000 0000 0000 0000 0000 0000 0000
   0600 c711 0000 4000 0000 0000 0000 0000
   0000 0400 0000 0000 0000 0ae8 0000 0000.





They almost always cause the machine to crash and/or become unpingable.
They are associated with the increase of the "errs" and "frame"
counters in /proc/net/dev.  However, it seems that sometimes these counters
also increase without the above error message.

Putting a different NIC card in, of a different brand, instead of
the on-board NIC enabled these nodes, only a few out of the 240,
to operate without errors.  Replacing the motherboard also fixes
this problem.  Is there something short of either of these
two solutions which could be done to stop this from happening?
I saw errors like this when we first got the nodes and the duplex
settings weren't set correctly, but now we are set to happily
autonegotiate so I don't think that is the problem.

We have actually thought about buying different network cards
to put into these machines, but are reluctant to do so because
we believe this would interfere with being able to PXE-boot
these machines.

Any help would be appreciated.

Steven Timm
------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Core Support Services Dept.
Assistant Group Leader, Scientific Computing Support Group
Lead of Computing Farms Team