[eepro100] Transmit errors with i8255X cards using eepro100 driver on compaq alpha

Jim Matthews jmatthew@tabdemo.larc.nasa.gov
Fri, 20 Apr 2001 11:36:59 -0400 (EDT)


I am in the process of setting up an alpha linux (beowulf) cluster but
am running into some network problems with two i8255X cards.  Originally
I had setup these cards with channel bonding but have disabled it so I
could more easily debug the problems I am having.  Right now each
machine on the cluster has 3 cards, one is a dec21143 (using de4x5
driver) the other 2 are the i8255X (using EtherExpress eepro100 driver)
cards (also from compaq). Each of the cards is on a different subnet.
To test throughput robustness I have sent large amounts of data (5
/dev/zero cats) to other cluster nodes.  I have found that if I saturate
one of the i8255X cards data will transfer without error.  The moment
I start to send data (eg: initiate a cat of /dev/zero) over one of the
other interfaces, either the 21143 or the other i8255X I will begin to
get transmit errors on one of the i8255X cards.  If I send data over
both i8255X cards I will get transmit errors on both i8255X interfaces,
but I never see transmit errors on the 21143 interface.  Transmit errors
are reported on all nodes getting sent the data.

The three interface cards are connected to 2 CISCO 3500 switches.  One
of the switches is segmented into 2 VLAN to isolate traffic between
interfaces.  When I observe a transmit error in linux I also notice that
I will see the switch try to renegotiate the connection for the
interface which reported the error.  I am assuming that this is a driver
problem but the switch renegotiation made me wonder about a hardware
problem or switch configuration, but since the 21143 interface works w/o
error connected to either switch I am assuming it is a driver issue.

Another message I am seeing in the i8255X debug (included at the end) is
the "TX ring dump".  I notice that "TX_RING_SIZE" is set in the
eepro100.c source code.  I was wondering if the setting for this value
might effect the problem...?

The alphas have the following configuration:
Compaq alpha XP1000 21264 667mHz
1.2 GB RAM
1 21143 card
2 i8255X cards
Redhat Linux v7.0
Kernel 2.4.2 (includes latest eepro100 driver, v1.36)

Do you have any idea what would be causing this problem?
Help is greatly appreciated.
Thanks,

--Jim Matthews
--System Administrator
--Raytheon Information Services
--NASA Langley Research Center

Additional info follows:

The following is detection of the two i82555 cards by the 2.4.2 kernel's

eepro100 driver:

Apr 19 15:56:50 cfdalc2n1 kernel: eepro100.c:v1.09j-t 9/29/99 Donald
Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
Apr 19 15:56:50 cfdalc2n1 kernel: eepro100.c: $Revision: 1.36 $
2000/11/17 Modified by Andrey V. Savochkin <saw@saw.sw.com.sg> and
others
Apr 19 15:56:50 cfdalc2n1 kernel: eth1: OEM i82557/i82558 10/100
Ethernet, 00:50:8B:B4:A2:5C, IRQ 36.
Apr 19 15:56:50 cfdalc2n1 kernel:   Board assembly 726837-017, Physical
connectors present: RJ45
Apr 19 15:56:50 cfdalc2n1 kernel:   Primary interface chip i82555 PHY
#1.
Apr 19 15:56:50 cfdalc2n1 kernel:   General self-test: passed.
Apr 19 15:56:50 cfdalc2n1 kernel:   Serial sub-system self-test: passed.

Apr 19 15:56:50 cfdalc2n1 kernel:   Internal registers self-test:
passed.
Apr 19 15:56:50 cfdalc2n1 kernel:   ROM checksum self-test: passed
(0x04f4518b).
Apr 19 15:56:50 cfdalc2n1 kernel: eth2: OEM i82557/i82558 10/100
Ethernet, 00:50:8B:B4:48:DA, IRQ 32.
Apr 19 15:56:50 cfdalc2n1 kernel:   Board assembly 726837-017, Physical
connectors present: RJ45
Apr 19 15:56:50 cfdalc2n1 kernel:   Primary interface chip i82555 PHY
#1.
Apr 19 15:56:50 cfdalc2n1 kernel:   General self-test: passed.
Apr 19 15:56:50 cfdalc2n1 kernel:   Serial sub-system self-test: passed.

Apr 19 15:56:50 cfdalc2n1 kernel:   Internal registers self-test:
passed.
Apr 19 15:56:50 cfdalc2n1 kernel:   ROM checksum self-test: passed
(0x04f4518b).


These are the syslog messages I am seeing relating to transmit time out:

Apr 19 16:55:25 cfdalc2n1 kernel: NETDEV WATCHDOG: eth2: transmit timed
out
Apr 19 16:55:25 cfdalc2n1 kernel: eth2: Transmit timed out: status 0050
0c00 at 10218925/10218953 command 000c0000.
Apr 19 16:56:27 cfdalc2n1 kernel: NETDEV WATCHDOG: eth3: transmit timed
out
Apr 19 16:56:27 cfdalc2n1 kernel: eth3: Transmit timed out: status 0050
0c00 at 9000445/9000473 command 000c0000.


This is a longer "debug" version of the above transmit errors for one
card:

NETDEV WATCHDOG: eth2: transmit timed out
eth2: Transmit timed out: status 0050  0c00 at 10201183/10201211 command

000c0000.
eth2: Tx ring dump,  Tx queue 10201211 / 10201183:
eth2:     0 200ca000.
eth2:     1 000ca000.
eth2:     2 000ca000.
eth2:     3 000ca000.
eth2:     4 000ca000.
eth2:     5 000ca000.
eth2:     6 000ca000.
eth2:     7 000ca000.
eth2:     8 200ca000.
eth2:     9 000ca000.
eth2:    10 000ca000.
eth2:    11 000ca000.
eth2:    12 000ca000.
eth2:    13 000ca000.
eth2:    14 000ca000.
eth2:    15 000ca000.
eth2:    16 200ca000.
eth2:    17 000ca000.
eth2:    18 000ca000.
eth2:    19 000ca000.
eth2:    20 000ca000.
eth2:    21 000ca000.
eth2:    22 000ca000.
eth2:    23 000ca000.
eth2:    24 200ca000.
eth2:    25 000ca000.
eth2:    26 400ca000.
eth2:   =27 000ca000.
eth2:    28 000ca000.
eth2:    29 000ca000.
eth2:    30 000ca000.
eth2:  * 31 000c0000.
eth2: Printing Rx ring (next to receive into 5354158, dirty index
5354158).
eth2:     0 00000001.
eth2:     1 00000001.
eth2:     2 00000001.
eth2:     3 00000001.
eth2:     4 00000001.
eth2:     5 00000001.
eth2:     6 00000001.
eth2:     7 00000001.
eth2:     8 00000001.
eth2:     9 00000001.
eth2:    10 00000001.
eth2:    11 00000001.
eth2:    12 00000001.
eth2: l  13 c0000001.
eth2:  *=14 00000001.
eth2:    15 00000001.
eth2:    16 00000001.
eth2:    17 00000001.
eth2:    18 00000001.
eth2:    19 00000001.
eth2:    20 00000001.
eth2:    21 00000001.
eth2:    22 00000001.
eth2:    23 00000001.
eth2:    24 00000001.
eth2:    25 00000001.
eth2:    26 00000001.
eth2:    27 00000001.
eth2:    28 00000001.
eth2:    29 00000001.
eth2:    30 00000001.
eth2:    31 00000001.