[3c509] Overruns in receive buffer

Madhav Diwan mdiwan@wagweb.com
Mon Dec 17 15:30:01 2001


Dear Friends ,


I have a situation at a client where i am running a firewall:

the computer is a PIII 600 mHz with 256 sdram sbc with an ata33 drive, 
the kernel is 2.0.33 and i am running ipfwadm. ( no i cant upgrade :(
the firewall program is custom made to use only this )

  the local lan is eth0
 
which is run over a 3com 905b-txnm card (revision? is E) . I have this
card in two units one primary one backup , both units are hooked up to a
Cisco Switch on diferent switch ports. 

Here is the problem:

 occasionally the local lan Nic of the primary firewall unit gets a
flood of buffer overruns on the recieve buffer and stops recieving
packets. 

ping to the interface from the local loopback  after i dial in to the
modem confirms that the card is up and that there have been a jump in
overruns within a two minute interval from 0 to 1000 or more. 

ping locally from lo to eth0 works .

 tcpdump -ln - i eth0  reveals no data coming to interface but data is
still forwarded from inside ( from the other interfaces ) trying to go
out eth0

 the network is configured with a private 10.100 network address with a
23 bit subnet mask.
this happens only on whichever unit is primary and does not seem to
affect the backup unit at all.
 
1.1 ) i changed MAC adresses

1.2 ) I have changed the card out and changed revision level from D to
E.

2) I have changed driver versions from the original caldera open linux
1.2 driver version to  beckers "3c59x.c:v0.99E 5/12/98 , .. this seemed
to have helped delay the event, untill i changed the buffersize.

3) i was sharing some irq's and so changed the cards around so that they
each had a separate irq from the mainboard . still the same  .. no
overrruns.. and then losts of OVERRUNS!!

4.1) I have changed recieve ring buffer sizes on the driver doubleing
the buffer from 32 to 64  

4.2) i have changed packer  buffer size on the driver, again doubling
the size from 1026 to 3072.

nothing seems to affect the event. , i'm inclined to think its an event
as there is no time schedule , sometimes the primary unit works for a
week or more , sometimes a day, once just adfter a reboot. 

 putting a hub in between the switch and the card seemed to help it last
a bit longer , before my buffer size changes , but i think that was just
coincidence as it certainly does not help now.

I am tring to get a different card , preferably a dec tulip chipset ,
rare to get these days :( , and try that scenario.

I also found an article that seemed likely to help , but now have gone
through most of the suggestions and still have the problem:

http://www.uwsg.iu.edu/hypermail/linux/kernel/0009.2/1199.html


i have a sniffer and packet loger attached  to a hub between to try to
capture the event but so far no luck as the capture buffer is only 20
megabytes ( windows  based sniffer ) and gets overwritten by the time i
get there. ( of course its a remote site... even if its only down the
street.) 


my most recent failure was on Sunday morning... i had only netbios
traffic being blocked untill 8:57AM..., ( generally i see this every 20
minutes )  the next log entry is that it cant ping over eth0 at 9:29 AM 
( 30 minutes later ) and is therefore rebooting, ( the backup reboots it
first since i scripted it to look for that failure on the primary as
well and log what it did ).

Any help or advice at this point is GREATLY needed and appreciated.

 Thank you 

Sincerely,


Madhav Diwan