"Transmit timed out" with EtherExpress Pro100B

Tue Oct 6 12:38:06 1998

On Tue, 6 Oct 1998, James Stevens wrote:

> I did, I was getting loads of counts in the TX-FIFO, so I tried upping
> (to 15) and downing (to 2) both the txiffo & rxfifo, and I tried
> removeing the transmitter restart, none of this made any difference. It
> chip still hung. I've had this bug, completely re-producable between two
> linux boxes on 10Mbps ethernet since 1st Sep and I'm just getting
> frustrated with it (the bug that is).
> 
> If you have any ideas I'd really appricate them. The error I get is
> "status 0050 command 0000" generated by the function
> "speedo_tx_timeout". I believe the transmitter _has_ hung, as the system
> locks totally when I remove the transmitter restart, and I suspect it is
> down to a bug in the chip I outlined in my previous e-mail.

A few questions (I don't remember your original posting):

  a) Are your systems SMP or UP?  Fast (e.g. PPro or PII)?
  b) Is your 10Mbps interconnect crossover wire or hub or switch?

The reasons for asking both is that the eepro100 has had known rentrancy
problems for SMP systems.  They don't appear to affect performance or
cause lockups in most cases and I think the threshold for reporting them
at all has been moved up, but it occurs to me that looping reentrancy on
fast SMP box with a relatively slow channel might exhaust some resource
and/or prevent the disentangling of all the interrupts.

A SECOND reason for asking is that if you compiled your kernel with SMP
and are using a hand-built non-SMP eepro100.o module, the
test_and_set_bit and clear_bit commands used to manipulate the
dev->tbusy flag will almost certainly fail by reordering and it would
not surprise me at all to find both reentrancy and transmitter hang, if
not a full IRQ deadlock.  If you built the device into your kernel then
of course it has __SMP__ correctly defined and this cannot be the
cause.

If you are connecting them with a wire (or CAN connect them with a wire
or 100B hub/switch) you ought to be able to try out 100BT and see if the
problem persists.  I certainly don't see it here on a clean 100BT
switched channel, which doesn't mean a thing, of course.

A third thought is that I've experienced significant problems with
"good" cards depending on the state of the hub or switch.  As an
example, I have a number of systems with Tulip cards in them, and have
measured very good performance between pairs of tulips or eepro100's
through our Cisco Cat5K switch.  However, when we put our switch in
fixed 100BT/full duplex mode on certain ports, the cards failed to
correctly set themselves automatically.  They appeared to run 100BT/FD,
but all I could get out of them was around 1-2 Mpbs with tons of errors
and resets, not unlike what you report.  When I reset the switch so port
negotiation was back to "auto", the cards worked perfectly again.  Quite
possibly a bug in the card driver, but I don't know enough about the
MII/Nway process to know if it the bug was really in the hub.  Either
way, hub tuning and trying different card "options=xx" settings might
help.  I've found that it is almost always better to let the card
negotiate with a hub set to negotiate.

The curious thing is that your card works for a while and then fails,
consistently, under any kind of heavy network load.  (Right? About the
"heavy load"?) If it were an out and out failure of a command, via a
real, hard, bug, I'd sort of expect the driver to fail for everybody
(and very quickly and not necessarily in a load dependent way) -- so I
have to ask the list -- is ANYBODY successful with eepro100's on 10BT?
Especially under load?

If folks are successful with eepro100's and 10BT hubs, switches, and
crossover wires, in SOME systems it strongly suggests that the problem
is environmental -- something about your personal configuration or
networking operation that is tweaking the bug.  I assume that you've
done things like swapped hub/switch ports, replaced the cables, and
generally ensured that the problem isn't just bad media or hubs.  Have
you ruled out bad silicon?  Are you overclocking?  Could the net adapter
be failing because of excessive heat in the cases?  Sorry if these are
studid questions, but sometimes it is little things that cause big
failures.

> I guess what I really need to do is throttle the transmit traffic to the
> card or somthing ?
> 
> Can I reduce the TX FIFO to 0 to have packets sent immediatly, or not at
> all, instead of being queued ? I know it would hit performance, but it
> would be interesting to try.

The only thing that I can think of that you might usefully try after
considering the stuff above is to put in some printk's around the points
where the tbusy flag is being set or cleared.  This might reveal just
where in the queue things are screwing up -- I would guess that
something "impossible" is happening.  That's good as it should stand out
like a sore thumb.  I don't think any simple tuning will help...

Hope some of this helps,

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu