[vortex] 3c905-errors

Mon Oct 6 17:55:01 2003

On Mon, 6 Oct 2003, Steven Timm wrote:

> We have had the following hardware for about a year now
> and continue to have mysterious and intermittent network problems on it.

> Configuration: Tyan 2466 motherboard, built-in NIC: 3C905TX., dual
> AMD MP2000+ processors, 760MPX chipset. 1 GB RAM.  Typical mode of
> usage is fast burst transfer of large files, 1-2 GB.  Both NIC
> and switch are configured to auto-negotiate.

> 01:05.0 VGA compatible controller: ATI Technologies Inc Rage Mobility P/M AGP 2x (rev 64)

Are you running the video controller in graphics mode or text mode?
(Yes, it _might_ make a difference.  Some video controllers are notorious
for hogging bus bandwidth and violating bus-hold-time specs.)

> 02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)

Important detail: no SCSI controllers.

> dmesg shows:
> 02:08.0: 3Com PCI 3c905C Tornado at 0x3000. Vers LK1.1.18-ac
>  00:e0:81:23:66:5b, IRQ 19
>   product code 0000 rev 00.6 date 00-00-00

Ehhhh, this makes me suspicious that other vital parameters from the
EEPROM are bogus.  Specifically the settings that tell the BIOS the PCI
burst bandwidth requirements.

> We have seen three different and repeating problems.
> 
> One: eth0: Too much work in interrupt, status e401.
> This has previously been explained to me as not being the fault of the
> scyld driver but of the kernel interrupt-handling scheme.

Typically other device drivers in the kernel.  You can reduce or turn
off this message, but that doesn't "fix" it.  The message is reporting a
real problem.

> Two:
> Oct  5 23:53:33 fnd0139 kernel: eth0: Setting half-duplex based on MII #24 link partner capability of 0000.
> Oct  5 23:54:33 fnd0139 kernel: eth0: Setting full-duplex based on MII #24 link partner capability of 41e1.
> 
> These errors are associated with increases in the count of
> transmit carrier errors in /proc/net/dev.  They do, for about a minute,
> lead to loss of network connectivity with the node.

Yup, they would.  They seem to be reporting a loss of link beat.

> necessarily restarted and it does cause trouble.  The link light
> of the NIC goes out during this time.

That's a vital detail.
The link LEDs going out strongly indicates that it's not the driver
having a problem reading the status register, but rather the switch
dropping the link.

This can be verified by running 'mii-diag' while the link down.

>  We have seen this problem
> happen on all of 240 different nodes in a cluster,

Simultaneously, or a random pattern?

> The problem
> happens with all of them.  Interestingly enough, the switch does
> not record any errors in its error counters during these episodes.

Hmmm, it definitely should record a re-negotiation.

> may be causing this problem..in particular is there any
> record that Tyan may have misconfigured this NIC when they put it
> on their board?

The bogus EEPROM sections hint that they didn't take any special care
with the implementation.

> Errors of this type:
> nfs: server d0bbin-farm not responding, still trying
> eth0: transmit timed out, tx_status 00 status e601.
>   diagnostics: net 0ccc media 8880 dma 0000003a.
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?

This is very different case.
It likely a problem with the kernel's IRQ handling.
The device driver is reporting that the hardware raised an interrupt,
yet the interrupt handler was never called.

> On the machine that gave this error:
> [root@fnd0228 bin]# ./mii-diag

This isn't a tranceiver problem.  Check /proc/interrupts to see if the
IRQ count increases.

A work-around for the underlying bug might be to pass 'noapic' as a
kernel option.

> [root@fnd0228 bin]# ./vortex-diag -a
> vortex-diag.c:v2.14 12/28/2002 Donald Becker (becker@scyld.com)
...
>    Tx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to empty.
>    Rx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to full.
...
>    Maximum burst recorded Tx 0,  Rx 352.

Ehhh, that's a rather short burst.  The chip is capable of transferring
a whole 1500 byte packet to memory in a single burst.  On PCI 2.1, the
bus parameters should be set up to allow this.

> [root@fnd0228 bin]# ./vortex-diag -e

What does '-eee' (show all details) report?

> Saved EEPROM settings of a 3Com Vortex/Boomerang:
>  3Com Node Address 00:E0:81:23:66:5B (used as a unique ID only).
>  OEM Station address 00:E0:81:23:66:5B (used as the ethernet address).
>   Device ID 9200,  Manufacturer ID 6d50.
>   Manufacture date (MM/DD/YYYY) 0/0/2000, division , product .
>   No BIOS ROM is present.
>  Transceiver selection: Autonegotiate.
>    Options: negotiated duplex, link beat required.
>  PCI Subsystem IDs: Vendor 10f1 Device 2466.
>  100baseTx 10baseT.
>   Vortex format checksum is incorrect (29 vs. 10f1).
>   Cyclone format checksum is incorrect (0xb5 vs. 0xff).
>   Hurricane format checksum is incorrect (0x68 vs. 0xff).

They didn't get the checksum right.

> to operate without errors.  Replacing the motherboard also fixes
> this problem.  Is there something short of either of these
> two solutions which could be done to stop this from happening?
..
> We have actually thought about buying different network cards
> to put into these machines, but are reluctant to do so because
> we believe this would interfere with being able to PXE-boot
> these machines.

If you decide to do this, buy network adapters with their own PXE boot
ROMs.  The motherboard BIOS PXE ROM will not work, even with other
3c905c cards.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993