[eepro100] eepro100 failures

Donald Becker becker@scyld.com
Fri, 24 Aug 2001 00:55:04 -0400 (EDT)

On Thu, 23 Aug 2001, Steinar Hauan wrote:

> I have a small cluster of dual-cpu P3 machines on RedHat 7.1++
> with network trouble using Intel Pro/100 adapters. Specifically,
> a diff on tcpdump for the Tx and Rx ends -- reproducible by both nfs
> and ftp -- shows that 1 bit is being flipped. The error is quite rare;
> only a few specific bit sequences located at specific offsets produce
> the error. A typical bit pattern is 5-12 bytes and cause errors if
> found at offset j*4+1 in a packet with j=1, 2, ... , N.

This is a very unusual errors.

I'll first rule out what could be causing the problem.
   Bit flips on the wire
     These would be caught by the Ethernet CRC.  And besides, with
     100baseTx bits are corrupted in groups of four, not one at a time.
     Ethernet errors are reported in /proc/net/dev and you'll likely see
     zillions before a undetected bit slips through.  (The probability
     depends heavily on the noise type.)
   Bit flips inside the chip
   Bit flips on the bus
     These would both be caught by the TCP/IP checksum.  No single bit
     error will slip through.  Additionally the PCI bus has parity check
     which will catch single bit errors.
     Note that my drivers do not use the Rx checksum support in the
     eepro100 chips.  Recent driver do show how to retrieve the partial
     checksum, but this is good example of why it's questionable to use
     the feature.

So that leaves us with memory, kernel or processor errors after the TCP/IP
checksum is computed.
With only a single bit flipped, it's unlikely to be a wild write to
memory from some other part of the kernel.

> Now here is the main cause of concern. Yesterday, I went out to my
> local computer store and bought 4 new ethernet cards.
>   1x 3Com PCI 3c590 Vortex 10Mbps (10Tx-HD)
>   1x 3Com 3c905B Cyclone 100baseTx (100Tx-FD)
>   1x Intellinet 10/100 PCI network card
>   1x SMC 1244TX Rev B (100TX-FD)
>   the last two card use the RealTek RTL8139 chip (100Tx-FD).
> Whenever i boot my machines with one of the above cards along
> with the "noapic" kernel option, the errors go away.

My guess: the bit flips occur as memory corruption during bus master
writes from the PCI bus.  The eepro100 chip is likely a '559 which is
PCI v2.1 and can generate very long PCI bursts.  The 3c590, 905B and
rtl8139 will not generate packet-sized burst.  Or perhaps the eepro100
has just the wrong PCI timing that triggers memory corruption.

[[ BTW, where did you find an _ancient_ 3c590?!  Under a pyramid? ]]

> Could this be a
> driver error? What else could it be? Why does "noapic" make a difference
> with IRQs locked to specific pci slots? (and nothing to share with)

Different topic: "noapic" disables the extra APIC features and thus avoids
a long-standing Linux kernel bug.

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993