[eepro100] eepro100 82559 problems

Nate Amsden subscriptions@graphon.com
Thu, 15 Feb 2001 13:27:56 -0800


hi

after seeing this message posted by Antwerpen@netsquare.org:
http://www.scyld.com/pipermail/eepro100/2001-February/001509.html

i figured i should post because i have a very similar problem.

We have 3 identical 1U systems running Supermicro S370SSE motherboards
(at least im 99.99999% sure it is, i cant be 100% sure without taking
the system apart). They have dual onboard Intel 82559 NICs.

(somewhat related..)
When using OpenBSD 2.8 on one of them, the machine seemed to crash
after about 5 minutes of use(firewaling/port forwarding under
very low load maybe 10kb/s at best).

I have since replaced OpenBSD 2.8 with Debian GNU/Linux 2.2r2 and
kernel 2.2.17+many patches including modules for eepro100 v1.11a.
this machine has been operating perfectly for the past 68 days
20 hours.  At another location on the other side of the country
we are trying to deploy the 2nd of 3 systems, using a similar
configuration(kernel and modules are identical, bios settings
match etc) and since we deployed it on monday i think it was
it has consistantly locked up hard every night. Today we 
synched the bios settings between the unit here and there and
things seemed to be going better however the errors are still
showing up. something that has never shown up in the logs in
the unit here.

sample log entry:

Feb 10 05:37:11 gate-nh kernel: eth1: Transmit timed out: status 0050  0080 at
59/61 commands 000c0000 400c0000 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1: Tx ring dump,  Tx queue 61 / 59:
Feb 10 05:37:11 gate-nh kernel: eth1:   0 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   1 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   2 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   3 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   4 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   5 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   6 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   7 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   8 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   9 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   10 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   11 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   12 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   13 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   14 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   15 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   16 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   17 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   18 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   19 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   20 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   21 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   22 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   23 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   24 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   25 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   26 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1: * 27 000c0000.
Feb 10 05:37:11 gate-nh kernel: eth1:   28 400c0000.
Feb 10 05:37:11 gate-nh kernel: eth1:  =29 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   30 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   31 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:Printing Rx ring (next to receive into
143).
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 0  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 1  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 2  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 3  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 4  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 5  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 6  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 7  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 8  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 9  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 10  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 11  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 12  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 13  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 14  c0000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 15  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 16  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 17  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 18  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 19  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 20  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 21  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 22  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 23  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 24  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 25  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 26  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 27  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 28  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 28  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 29  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 30  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 31  00000001.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 0 is 3100.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 1 is 782d.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 2 is 02a8.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 3 is 0320.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 4 is 05e1.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 5 is 0021.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 21 is 0000.
Feb 10 05:37:11 gate-nh kernel: eth1: Tx ring dump,  Tx queue 61 / 59:
Feb 10 05:37:11 gate-nh kernel: eth1:   0 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   1 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   2 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   3 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   4 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   5 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   6 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   7 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   8 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   9 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   10 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   11 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   12 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   13 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   14 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   15 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   16 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   17 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   18 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   19 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   20 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   21 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   22 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   23 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   24 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   25 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   26 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1: * 27 000c0000.
Feb 10 05:37:11 gate-nh kernel: eth1:   28 400c0000.
Feb 10 05:37:11 gate-nh kernel: eth1:  =29 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   30 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:   31 000ca000.
Feb 10 05:37:11 gate-nh kernel: eth1:Printing Rx ring (next to receive into 143)
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 0  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 1  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 2  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 3  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 4  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 5  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 6  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 7  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 8  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 9  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 10  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 11  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 12  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 13  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 14  c0000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 15  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 16  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 17  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 18  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 19  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 20  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 21  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 22  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 23  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 24  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 25  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 26  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 27  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 28  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 29  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 30  00000001.
Feb 10 05:37:11 gate-nh kernel:   Rx ring entry 31  00000001.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 0 is 3100.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 1 is 782d.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 2 is 02a8.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 3 is 0320.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 4 is 05e1.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 5 is 0021.
Feb 10 05:37:11 gate-nh kernel:   PHY index 1 register 21 is 0000.


I'm not sure what kind of machine was replaced by this one but I
could find out..it was a redhat machine and it ran for about the
past year until we decided to replace it with a racked debian
box. any idea what could cause this? It only happened once we
started using the new system. And I bet the OpenBSD crashes
on my end here were the result of something similar. however,
in OpenBSD it didn't give any errors, it just dumped to the
debugger and sat there until i rebooted it. buggy chip?
buggy driver? hard to imagine the driver is to blame as
this other system has been running for over 2 months without
a single problem.

running ifconfig on both systems shows:
(on broken system)
  4:27pm  up  5:48,  1 user,  load average: 0.00, 0.00, 0.00
eth0      Link encap:Ethernet  HWaddr 00:30:48:11:02:D8  
          inet addr:192.168.100.2  Bcast:192.168.100.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:417948 errors:0 dropped:0 overruns:0 frame:0
          TX packets:184575 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          Interrupt:11 Base address:0xb000 

eth1      Link encap:Ethernet  HWaddr 00:30:48:11:12:16  
          inet addr:XX.XX.XX.XX  Bcast:XX.255.255.255  Mask:255.255.255.XXX
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:207215 errors:0 dropped:0 overruns:0 frame:0
          TX packets:192204 errors:2 dropped:0 overruns:0 carrier:0
          collisions:157 txqueuelen:100 
          Interrupt:5 Base address:0xd000 

eth1:0    Link encap:Ethernet  HWaddr 00:30:48:11:12:16  
          inet addr:XX.XX.XX.XXX  Bcast:XX.255.255.255  Mask:255.255.255.XXX
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:5 Base address:0xd000 

(on working system)
  1:24pm  up 68 days, 20:42,  1 user,  load average: 0.00, 0.04, 0.06
eth0      Link encap:Ethernet  HWaddr 00:30:48:11:02:D9  
          inet addr:XX.XX.XX.XX  Bcast:XX.XX.XXX.XXX  Mask:255.255.255.XXX
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:101475510 errors:0 dropped:0 overruns:0 frame:1
          TX packets:117209873 errors:0 dropped:0 overruns:0 carrier:116
          collisions:15698167 txqueuelen:100 
          Interrupt:11 Base address:0x9000 

eth0:1    Link encap:Ethernet  HWaddr 00:30:48:11:02:D9  
          inet addr:XX.XX.XX.XXX  Bcast:XX.XX.XX.255  Mask:255.255.255.XXX
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:11 Base address:0x9000 

eth1      Link encap:Ethernet  HWaddr 00:30:48:11:12:17  
          inet addr:192.168.50.20  Bcast:192.168.50.255  Mask:255.255.255.XX
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:123049598 errors:0 dropped:0 overruns:0 frame:0
          TX packets:104000322 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          Interrupt:5 Base address:0xb000 

I point this out because eth1 on the broken system had 2 TX errors,
and after all the weeks and packets that have gone through the working
system not a single error. although a lot of collisions, but it is
hooked up to a $10 hub...

here is the kernel log for the broken system when the kernel loaded
the driver:

Feb 15 02:38:45 gate-nh kernel: eepro100.c:v1.11a 7/31/2000 Donald Becker
<becker@scyld.com>
Feb 15 02:38:45 gate-nh kernel:   http://www.scyld.com/network/eepro100.html
Feb 15 02:38:45 gate-nh kernel: eth0: OEM i82557/i82558 10/100 Ethernet at
0xc808b000, 00:30:48:11:02:D8, IRQ 11.
Feb 15 02:38:45 gate-nh kernel:   Receiver lock-up bug exists -- enabling
work-around.
Feb 15 02:38:45 gate-nh kernel:   Board assembly 000000-000, Physical connectors
present: RJ45
Feb 15 02:38:45 gate-nh kernel:   Primary interface chip i82555 PHY #1.
Feb 15 02:38:45 gate-nh kernel:   General self-test: passed.
Feb 15 02:38:45 gate-nh kernel:   Serial sub-system self-test: passed.
Feb 15 02:38:45 gate-nh kernel:   Serial sub-system self-test: passed.
Feb 15 02:38:45 gate-nh kernel:   Internal registers self-test: passed.
Feb 15 02:38:45 gate-nh kernel:   ROM checksum self-test: passed (0x04f4518b).
Feb 15 02:38:45 gate-nh kernel: eth1: OEM i82557/i82558 10/100 Ethernet at
0xc808d000, 00:30:48:11:12:16, IRQ 5.
Feb 15 02:38:45 gate-nh kernel:   Receiver lock-up bug exists -- enabling
work-around.
Feb 15 02:38:45 gate-nh kernel:   Board assembly a19716-001, Physical connectors
present: RJ45
Feb 15 02:38:45 gate-nh kernel:   Primary interface chip i82555 PHY #1.
Feb 15 02:38:45 gate-nh kernel:   General self-test: passed.
Feb 15 02:38:45 gate-nh kernel:   Serial sub-system self-test: passed.
Feb 15 02:38:45 gate-nh kernel:   Internal registers self-test: passed.
Feb 15 02:38:45 gate-nh kernel:   ROM checksum self-test: passed (0x04f4518b).

I imagine the same is similar for the working system however the bootup
logs are cycled and overwritten after a month of uptime.

network load on both systems is extremely light, MRTG reports over
the past 5 weeks average network traffic 2.9kB/s both ways for the
broken system. the working one averages 13-14kB/s both ways for
the past 5 weeks. both systems are on 1Mbit dsl connections.

the 3rd is sitting on a shelf waiting for someone to get the time
to set it up. its in another state so i don't have access to it.

The machines themselves are Single P3-733Mhz 128MB ram, using
that Supermicro motherboard, a single 20GB quantum IDE drive.

any ideas would be appreciated :) i have a feeling it will
lockup again tonight.

thanks!

nate

-- 
Nate Amsden
System Administrator
GraphOn
http://www.graphon.com