SMP+Tulip lockup, 2.0.36pre15, help requested

G.W. Wettstein greg@wind.enjellic.com
Thu Nov 5 16:37:58 1998


On Nov 5, 11:03am, Donald Becker wrote:
} Subject: Re: SMP+Tulip lockup, 2.0.36pre15, help requested

> On Thu, 5 Nov 1998, Alan Cox wrote:
> 
> > Subject: Re: SMP+Tulip lockup, 2.0.36pre15, help requested
> > 
> > > This is a bug in the SMP interrupt dispatch code, which the driver is
> > > detecting.
> > > A patch is known -- I'm suprised it's not in pre15.
> > 
> > Im still busy feeding the patch to anyone with a problem and finding how
> > well it works. So far nobody who has tried it  has reported a problem which
> > is a good sign.
> > 
> > So its still down for 37pre1.

> Errrm, I'm suprised that it's not a priority for 2.0.36.
> This is a serious SMP bug that is likely to hang most drivers that don't
> explicitly check for the problem.  I got a lot of mail on the subject.
> 
> [[ I have a personal stake in this -- everyone assumes that since my drivers
> detect this, that it's my bug.  ]]

In the FWIW department I think that this needs to be given serious
consideration for 2.0.36 as well.  I have one production SMP box that
is running UP because of possible SMP stability problems.  Linux is
beginning to hit the enterprise marketplace very rapidly and stable
SMP is critical IMHO.

On a related note Don as long as I possibly have your attention I
think that we tracked down the epic problems as well.  I think that
there is possibly a race condition in the driver with respect to
handling packets that get stomped on.

I spoke with you on the phone about this and I reported the problems a
couple of times to the list.  One of the other machines on the same
shared network segment had an Etherpower II that didn't switch from
full-duplex when we transferred the machine from a switch to a shared
hub.  This may have been been secondary to the v0.99B driver which was
in the stock 2.0.35 kernel.

The only two machines on the shared segment were the SMP box and this
uni-processor machine.  Even with the SMP machine running a UP kernel
we were seeing the NIC card hang which was remedied by downing and
upping the interface.  Interestingly the /proc/dev/net statistics were
reporting exactly 2 carrier type errors for each errs (general error)
condition.  Here is a snippet of a watch log that shows the errors
being reported at 15 minutes intervals:


10/29/98 09:02:29
Inter-|   Receive                  |  Transmit
 face |packets errs drop fifo frame|packets errs drop fifo colls carrier
    lo:    444    0    0    0    0      444    0    0    0     0    0
  eth0:1988037 4359 4359    0 3083  1585647 2580    0    0 52091 5160
  eth1:  74857    0    0    0    0   671148    0    0    1     0    0

10/29/98 09:17:29
Inter-|   Receive                  |  Transmit
 face |packets errs drop fifo frame|packets errs drop fifo colls carrier
    lo:    444    0    0    0    0      444    0    0    0     0    0
  eth0:2010084 4359 4359    0 3120  1604509 2617    0    0 52522 5234
  eth1:  74887    0    0    0    0   671148    0    0    1     0    0

10/29/98 09:32:30
Inter-|   Receive                  |  Transmit
 face |packets errs drop fifo frame|packets errs drop fifo colls carrier
    lo:    452    0    0    0    0      452    0    0    0     0    0
  eth0:2030163 4359 4359    0 3154  1619835 2640    0    0 52755 5280
  eth1:  74919    0    0    0    0   671148    0    0    1     0    0


We ended up tracking this down by putting a sniffer on the shared
segment and in doing so we noticed that the SMP box was getting its
frames stomped on very late.  This tipped us off to the fact that the
other machine wasn't paying much attention to who else was talking.

I don't know the significance of two carrier errs getting reported for
each error in the errs field but I suspect it may give us some hint to
how the driver is getting fouled up.  An SMP kernel makes the
situation even worse.

The machine in question has been absolutely trouble-free since we
reset the card in the other machine that was stuck in full-duplex
mode.  We do see occassional errs being reported but now there is an
exact one-to-one relationship between the errs count and the carrier
count.

For interested onlookers the message is that if you are going to use
an SMC Etherpower II card in a high-traffic or production machine be
sure that your kernel sources are upgraded to the v1.04 driver.  While
it shouldn't happen in a properly functioning network it should also
be noted the the EPIC driver probably has trouble coping with
situations where it encounters a large number of transmit errors.

Hopefully others can gain from this knowledge.  It was reasonably
frustrating tracking this problem down.

Donald if there is anything I can do to help debug the problem further
let me know. 

> Donald Becker					  becker@cesdis.gsfc.nasa.gov

A nice end of the week to everyone.

Greg

}-- End of excerpt from Donald Becker

As always,
Dr. G.W. Wettstein           Enjellic Systems Development - Specialists in
4206 N. 19th Ave.	     intranet based enterprise information solutions.
Fargo, ND  58102             WWW: http://www.enjellic.com
Phone: 701-281-1686	     EMAIL: greg@wind.enjellic.com
------------------------------------------------------------------------------
"I am returning this otherwise good typing paper to you because
someone has printed gibberish all over it and put your name at the
top.
				-- English Professor, Ohio University