[vortex-bug] tbusy and tx_full

Donald Becker becker@scyld.com
Thu, 18 May 2000 13:29:12 -0400 (EDT)


On Thu, 18 May 2000, Bogdan Costescu wrote:

> On Thu, 18 May 2000, Donald Becker wrote:
> > The dev->tbusy flag isn't just for the driver to lock its transmit routine.
> > It is used to tell the software queue layer that the device cannot accept
> > another packet to transmit.  The driver should leave dev->tbusy set when it
> > cannot accept more packets, and clear it in the interrupt routine as Tx
> > space becomes available.
> 
> Errr... then why do we check for tbusy at the beginning of start_xmit? If

Because the queue layer doesn't do a perfect check.
Having the dev->tbusy lock later allows better queue-level performance
because it only needed to do an approximate check of dev->tbusy, rather than
locking a longer code path.

The Tx-timeout code, based on calling the Tx routine with dev->tbusy set,
was originally a recovery hack based on the *bug* of potentially re-entering
the Tx routine on a timer-based retransmission.  (It worked out very well,
because you were usually only retransmitting when a packet didn't get
through!)

> > Triggering the BH to run is somewhat expensive.  With hysteresis you allow
> > the queue layer to do a chunk of work at once, rather than constantly
> > cycling from the interrupt handler to the queue layer refill.
> 
> Yes, and you increase the latency. You know, the Beowulf guys are
> sensitive to this...

No, it doesn't increase the latency.  The queue still has 6/8/12 packets
waiting to be transmitted when the refill is triggered.

> And what about the M-VIA and S-Core guys who are using the kernel drivers
> as starting point for their low latency comunnication software; do they
> know about this "feature" ?

Hmmm, I don't really care about the people violating the GPL.  If you start
with GPL software, you must release the entire work under the GPL.  You
can't say "here are the mods to the GPL code so that my new non-GPL module
will link with it".  That's a derived work.  If you don't agree with the GPL,
don't base your code on GPLed works.

[[ The hysteresis doesn't impact the performance anyway. ]]

> > TCP/IP traffic, the driver sees diminishing returns once the driver queue
> > length reaches 5-7 entries.  More than 10 entries almost never increases
> > performance.
> 
> Is this with or without interrupt mitigation?

Good question with this driver.
Interrupt mitigation has little impact with TCP/IP traffic, if the driver
always checks the Tx queue in the interrupt handler.  The goal is that the
Rx traffic, which must be handled on a timely basis, will cause all
interrupts and the Tx queue will be cleaned up as needed.

You might think that the Tx routine itself would be a good place to clean up
the Tx queue.  The 3c509 driver is an example of a driver that gathers the
status of the transmitted packets at the end of the transmit routine.  But
the 3c509 driver has copied the packet to the card, while bus-master drivers
are still locking the skbuffs.  The protocol layer thinks "oh, that packet
hasn't been lost.  It's still in the Tx queue" and you sometimes don't get
a retransmit when you should.

> > Yes, the current 3c59x interrupt handler has the very bad behavior of
> > doing the following while cleaning up the Tx queue
> > 	if (inl(ioaddr + DownListPtr) == virt_to_bus(&vp->tx_ring[entry]))
> > This is an expensive PCI read for every packet free.  It mostly made sense
> > with the 3c905, but it's not needed at all with the 3c905B/C.
> 
> That's why I proposed 2 methods for avoiding this:
> 
> 1. use PktID in DPD FSH = index of DPD in tx_ring. This will limit the
> size of the ring to 256 entries (which is too much anyway IMHO). You can
> read only one register which will give you the PktID of the current DPD
> being processed. Then you free all DPDs between dirty_tx and this PktID.
> Only 1 PCI read.

For Linux that is no more informative than the  DownListPtr  value that we
currently use.

> 2. use dnComplete bit in DPD FSH. This is set when the NIC finishes
> processing that DPD (Warning: B and C rev. have different download
> sequences!). You keep the existing loop, but you make 
> 	tx_ring[entry].status & dnComplete
> your condition in "if". This does not impose the tx_ring size limit. No

This looks exactly like the other bus-master drivers, and is what I have in
my test version (.

The Tulip had this part right from the beginning.  Well, they actually
cribbed some of the design elements from AMD.  Every designer that studied
the Tulip got it mostly right.

Remember that some of this driver code was written, and the hardware
designed, when a PCI I/O read might be faster than a uncached memory read.
Fast Ethernet was *fast*, with a packet reception time comparable to a
context switches.

Donald Becker				becker@scyld.com
Scyld Computing Corporation
410 Severn Ave. Suite 210
Annapolis MD 21403