[vortex] 3c905-errors
Donald Becker
becker@scyld.com
Mon Oct 6 17:55:01 2003
On Mon, 6 Oct 2003, Steven Timm wrote:
> We have had the following hardware for about a year now
> and continue to have mysterious and intermittent network problems on it.
> Configuration: Tyan 2466 motherboard, built-in NIC: 3C905TX., dual
> AMD MP2000+ processors, 760MPX chipset. 1 GB RAM. Typical mode of
> usage is fast burst transfer of large files, 1-2 GB. Both NIC
> and switch are configured to auto-negotiate.
> 01:05.0 VGA compatible controller: ATI Technologies Inc Rage Mobility P/M AGP 2x (rev 64)
Are you running the video controller in graphics mode or text mode?
(Yes, it _might_ make a difference. Some video controllers are notorious
for hogging bus bandwidth and violating bus-hold-time specs.)
> 02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
Important detail: no SCSI controllers.
> dmesg shows:
> 02:08.0: 3Com PCI 3c905C Tornado at 0x3000. Vers LK1.1.18-ac
> 00:e0:81:23:66:5b, IRQ 19
> product code 0000 rev 00.6 date 00-00-00
Ehhhh, this makes me suspicious that other vital parameters from the
EEPROM are bogus. Specifically the settings that tell the BIOS the PCI
burst bandwidth requirements.
> We have seen three different and repeating problems.
>
> One: eth0: Too much work in interrupt, status e401.
> This has previously been explained to me as not being the fault of the
> scyld driver but of the kernel interrupt-handling scheme.
Typically other device drivers in the kernel. You can reduce or turn
off this message, but that doesn't "fix" it. The message is reporting a
real problem.
> Two:
> Oct 5 23:53:33 fnd0139 kernel: eth0: Setting half-duplex based on MII #24 link partner capability of 0000.
> Oct 5 23:54:33 fnd0139 kernel: eth0: Setting full-duplex based on MII #24 link partner capability of 41e1.
>
> These errors are associated with increases in the count of
> transmit carrier errors in /proc/net/dev. They do, for about a minute,
> lead to loss of network connectivity with the node.
Yup, they would. They seem to be reporting a loss of link beat.
> necessarily restarted and it does cause trouble. The link light
> of the NIC goes out during this time.
That's a vital detail.
The link LEDs going out strongly indicates that it's not the driver
having a problem reading the status register, but rather the switch
dropping the link.
This can be verified by running 'mii-diag' while the link down.
> We have seen this problem
> happen on all of 240 different nodes in a cluster,
Simultaneously, or a random pattern?
> The problem
> happens with all of them. Interestingly enough, the switch does
> not record any errors in its error counters during these episodes.
Hmmm, it definitely should record a re-negotiation.
> may be causing this problem..in particular is there any
> record that Tyan may have misconfigured this NIC when they put it
> on their board?
The bogus EEPROM sections hint that they didn't take any special care
with the implementation.
> Errors of this type:
> nfs: server d0bbin-farm not responding, still trying
> eth0: transmit timed out, tx_status 00 status e601.
> diagnostics: net 0ccc media 8880 dma 0000003a.
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
This is very different case.
It likely a problem with the kernel's IRQ handling.
The device driver is reporting that the hardware raised an interrupt,
yet the interrupt handler was never called.
> On the machine that gave this error:
> [root@fnd0228 bin]# ./mii-diag
This isn't a tranceiver problem. Check /proc/interrupts to see if the
IRQ count increases.
A work-around for the underlying bug might be to pass 'noapic' as a
kernel option.
> [root@fnd0228 bin]# ./vortex-diag -a
> vortex-diag.c:v2.14 12/28/2002 Donald Becker (becker@scyld.com)
...
> Tx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to empty.
> Rx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to full.
...
> Maximum burst recorded Tx 0, Rx 352.
Ehhh, that's a rather short burst. The chip is capable of transferring
a whole 1500 byte packet to memory in a single burst. On PCI 2.1, the
bus parameters should be set up to allow this.
> [root@fnd0228 bin]# ./vortex-diag -e
What does '-eee' (show all details) report?
> Saved EEPROM settings of a 3Com Vortex/Boomerang:
> 3Com Node Address 00:E0:81:23:66:5B (used as a unique ID only).
> OEM Station address 00:E0:81:23:66:5B (used as the ethernet address).
> Device ID 9200, Manufacturer ID 6d50.
> Manufacture date (MM/DD/YYYY) 0/0/2000, division , product .
> No BIOS ROM is present.
> Transceiver selection: Autonegotiate.
> Options: negotiated duplex, link beat required.
> PCI Subsystem IDs: Vendor 10f1 Device 2466.
> 100baseTx 10baseT.
> Vortex format checksum is incorrect (29 vs. 10f1).
> Cyclone format checksum is incorrect (0xb5 vs. 0xff).
> Hurricane format checksum is incorrect (0x68 vs. 0xff).
They didn't get the checksum right.
> to operate without errors. Replacing the motherboard also fixes
> this problem. Is there something short of either of these
> two solutions which could be done to stop this from happening?
..
> We have actually thought about buying different network cards
> to put into these machines, but are reluctant to do so because
> we believe this would interfere with being able to PXE-boot
> these machines.
If you decide to do this, buy network adapters with their own PXE boot
ROMs. The motherboard BIOS PXE ROM will not work, even with other
3c905c cards.
--
Donald Becker becker@scyld.com
Scyld Computing Corporation http://www.scyld.com
914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993