[vortex] Re: [vortex-bug] DPD FSH for Cyclone/Tornado

Tue, 16 May 2000 15:30:33 -0400 (EDT)

[[ Note: This discussion moved to linux-vortex from  linux-vortex-bug. ]]

On Tue, 16 May 2000, Bogdan Costescu wrote:

> > 		vp->tx_ring[entry].status = cpu_to_le32(skb->len);
> 
> You got this wrong:---------------------------------------^^^^^^
> I suggested:
> 		vp->tx_ring[entry].status = cpu_to_le32(TxIntrUploaded);

Doh!  I did M-C-k and didn't read the result.  Not enough sleep.

> > They look similar, but the calculation of
> >       if (vp->drv_flags & IS_BOOMERANG)
> > adds instructions to shuffle values on the register-poor x86.
> 
> I don't quite understand this: the test is done in both cases, you only
> make the code on each branch a little longer. This does affect the cache
> impact.

The subtle difference is that on some processors, with a good compiler,
   foo[entry].ele0 = 12;
   foo[entry].ele1 = 23;
   foo[entry].ele2 = 45;
   foo[entry].ele3 = 56;
fills a single write buffer line, which is then written to memory in a
single transaction.

Doing
   foo[entry].ele0 = 12;
   foo[entry].ele1 = 23;
   foo[entry].ele2 = 45;
   if (read_uncached_something & 0x8888)
        foo[entry].ele3 = 56;
Might trigger a read-modify-write of an incomplete memory or cache line,
followed by another R-M-W.

> Agreed. Another ideea would be to have separate routines for Boomerang and
> Cyclone/Tornado where no "if" is required. The driver grows a bit, but the
> cache impact is reduced.

Yes, this is critical path code, and it's relatively small.

> I assume that you mean the writting in data cache and not the cacheing of
> the code that writes. But there is still something that I don't
> understand: why are you so picky now about this when there are other
> sub-optimal things in the driver? Is a cache fault so much worse that I/O
> operations that we need anyway when dealing with H/W devices?

No.  PCI I/O operations are easily the biggest impact.  The PCI bus will
increasingly look like Ethernet, where you painfully wait as a long burst
passes.  Even no-conflict reads take the time of 100s of instructions.

> I'm thinking here of the tbusy flag which has no special use AFAICS.

On many kernels versions it does.
The Tx-timeout code, based on calling the Tx routine with dev->tbusy set,
was originally a recovery hack based on the *bug* of potentially re-entering
the Tx routine on a timer-based retransmission.  (It worked out very well,
because you were usually only retransmitting when a packet didn't get
through!)

Having the dev->tbusy lock later allowed better queue-level performance
because it only needed to do an approximate check of dev->tbusy, rather than
locking a longer code path itself.

> ...
> sync). If this flag is eliminated, the code becomes shorter in several
> places.

Yes, *every* driver is simplified.
More importantly it moves all queue and SMP locking to the queue layer.
Ideally, a driver for a clean device design won't need to know about SMP at
all, certainly not in the critical path.

Donald Becker				becker@scyld.com
Scyld Computing Corporation
410 Severn Ave. Suite 210
Annapolis MD 21403