[eepro100] Receiver lock-up -- is this right??

les Niles lniles@Narus.com
Wed, 8 Aug 2001 07:59:34 -0700


> From: Donald Becker [mailto:becker@scyld.com]
> On Tue, 7 Aug 2001, les Niles wrote:
> 
> > I'm trying to track down an eepro100 lock-up problem.  ...  
> 
> This test is wrong.
> 
> > The second, almost at the end of speedo_found1(), sets the 
> rx_bug flag to 
> > "(eeprom[3] & 0x03) == 3 ? 0 : 1" and prints another 
> KERN_INFO message if 
> > rx_bug is set.
> 
> This second test is correct, and it's the one that sets the 
> work-around
> bit.
> 
> 0x0001 No need to run the work-around at 100Mbps 
> 0x0002 No need to run the work-around at 10Mbps 
> 
> Since the driver doesn't always know that it can find the speed
> detected, it always runs the work-around if either bit is set.  The
> work-around has minimal impact, so this is reasonable.

Thanks; that makes sense.  And it turns out the work-around doesn't 
solve my lockup problem -- I forced rx_bug to be set and that didn't 
stop the lockup.  

So let me ask for other ideas.  Here's the setup: A Dell PowerEdge 1550 
with a pair of on-board eepro100s, running RedHat 7.0, SMP kernel, and 
the v1.15 eepro100.  (We've previously seen the same symptoms with 
different motherboards, and versions of the kernels and driver, so 
I'm guessing it's not specific to these.)  Originally sleep mode was 
enabled on both interfaces; I turned it off with no effect on the lockup 
problem.

Data gets blasted at the interface in question, at 4-10 MB/sec, in the 
form of 1500 byte UDP packets.  Not much is being sent out that interface, 
just a few packets due to an open telnet session.  After a few minutes -- 
on the order of a million to ten million packets received (higher data 
rates cause the problem to appear much sooner) -- the receiver stops 
receiving.  All the RX counters stop incrementing, with RX-ERR at 3 and 
RX-OVR at 7.  (Before the lockup RX-ERR, RX-DRP and RX-OVR are all 0.)  
The TX counter still claims that occassional packets are being sent.  
Bouncing the interface (ifconfig down/ifconfig up) brings it back to life.  

There are two messages that get logged:
  eth1: Unknown receiver error, status=0x5048.
  eth1: Unknown receiver error, status=0x5148.
A post-lockup dump of eepro100-diag is attached.   

One curious thing that strikes me is the 0x0100 bit in the status.  Not 
having access to #$%! Intel's doc, I can't figure out what this is, but 
from it's location and the code it looks suspiciously like an abnormal 
interrupt.  If so, I don't see any place in the driver where this it 
would get acknowledged.

speedo_intr_error() is obviously getting called and taking the branch 
that should restart the receiver, but it's not working.  It's not that 
we need to reliably receive data at this ridiculously high rate, but in 
an unattended application, the possibility of the NIC going into zombie 
mode due to some traffic spike is potentially disastrous.  

I'd be happy to completely reset the NIC when the problem occurs, but 
don't understand the driver well enough to know what to do or why 
speedo_intr_error()'s restart isn't doing that.  

It also occurs to me that maybe the driver is running out of RX buffers.  
Cranking up the RX_RING_SIZE would be fine if doing so would actually 
fix the problem rather than just delaying it enough to not be seen in 
testing. 

Any help or insight would be appreciated.  TIA,

  -les

--------------------------------------------------------------------
Post-lockup eepro100-diag output:

# eepro100-diag -f -v -ee -p 0xec80
eepro100-diag.c:v2.05 6/13/2001 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Assuming a Intel i82557/8/9 EtherExpressPro100 adapter at 0xec80.
i82557 chip registers at 0xec80:
  00000150 00000000 00000000 00080002 18250081 00000000
  Interrupt sources are pending.
   The transmit unit state is 'Suspended'.
   The receive unit state is 'Ready'.
  This status is unusual for an activated interface.
EEPROM contents, size 64x16:
    00: 0600 195b 6450 0503 0000 0201 4701 0000
  0x08: 02d4 8400 4880 00da 1028 0000 0000 0000
      ...
  0x30: 002c 0000 0000 0000 0000 0000 0000 0000
  0x38: 0000 0000 0000 0000 0000 0000 0000 0888
 The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
  Station address 00:06:5B:19:50:64.
  Board assembly 02d484-000, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
 MII PHY #1 transceiver registers:
  1000 782d 02a8 0154 05e1 0081 0000 0000
  0000 0000 0000 0000 0000 0000 0000 0000
  0a02 0000 0001 0000 0000 0000 0000 0000
  0000 0000 0b20 0000 0000 0000 0000 0000.#

Prior to lockup, the registers were 
  00000050 1dfad8e4 00000000 00080002 18250081 00000600
Everything else -- EEPROM and MII -- remain the same.