interface dies under network load on SMP machines

Mike Simons msimons@saic1.com
Thu Aug 27 14:22:54 1998


>   Here we go with a new problem with SMP 2.0.35+ (+ meaning also
> with Alans pre36 kernels) and tulip.c (up to v0.89K - SMPCHECK
> compiled in).
> 
>    The interface goes dead from time to time without leaving any log
> messages - a simple ifconfig down/up brings them back to live. To
> stabilise the systems I've written a small program that checks the
> network and restarts the interface if required. So everything is 
> nearly perfectly fine...
> 
> Here are the tulip-diag outputs for the Kingston KNE 10/100 cards:

Wolfgang and all...

    Wonder if this might be a problem outside the tulip driver?


  I've been seeing the same "ifconfig eth* down/up" problem happening to 
non-SMP machines, all at 10bT, mostly with 3c509 cards... 
  This problem seems to take time (more than 14 days) to appear... so 
it only effects the old (10bT) servers, running newer kernels.
- tcpdumps from an affected machines show only local arps going out...
- tcpdumps on other machines show nothing on the wire from these...
A simple ifconfig up/down fixes the problem.

kernels:
  2.0.33 with 3com509
  2.0.34 with 3com509
  2.0.35 with 3com509 and DE500
  2.1.106 with 3com509

  the last few times it has happened I noted this message on the console
(with a 3com card w/ 2.0.33) "eth0: Infinite loop in interrupt, status ffff."
  We have one SMC/tulip machine running 2.0.30... nfs server with 330 days
of uptime... never happened to that server.  A 2.0.33 server that 
the problem just happened on today went only 7 days since the last time...
but was up five months straight without a problem under 2.0.30.


  Most of the machines in the building are using SMC Tulip cards... and
they get rebooted a few times a week for a few minutes of Windows (Office),
so I don't know if the same thing would happen to those cards with new
kernels.

--
    Thanks,
      Mike Simons
      Science Applications International Corporation



don't be too worried about the DE500's on the list above:

  The DE500's we got recently and I'm still not happy with...
they don't auto-neg 100Tx... 
  - the times they have disappeared I
  - they sometimes take time to get working at 10bT on boot :).
  - I _think_ they may have just spontaneously decided to 
    auto-negotiate 100Tx with a 10bT hub when they dropped off.  
... need to tinker more before I'm sure there is any problem.