[vortex] Re: DnStall period

Wed, 17 May 2000 12:41:50 +0200 (CEST)

On Sat, 13 May 2000, Andrew Morton wrote:

> Bogdan,  I am at a loss to explain why increasing the loop count from
> 2,000 to 4,000 changed anything for you.   You're on switched 100bT,
> right?  You shouldn't be getting _any_ collisions (and when in full
> duplex mode the NIC doesn't even look for collisions).  So what's going
> on?
> I suspect that you're mistaken and that upping the loop counter was not
> the source of your success.
> Can you please drop my debug wait_for_completion() into your driver and
> let us know what happens?  Thanks.

Sorry for the late reply to this... Before rebooting for the no-tx_full
version of the driver, I had the 2.2.16-pre2 driver with
wait_for_completion added running for about 24h (the crash used to happen
in less that this).
The result is that I have one "big" value (166 or close to it) at the
driver init (from vortex_open), while all the others (which come only
from boomerang_start_xmit) are less than 20 (in fact biggest is 17). For
the values reported (i.e. > 10), I made a small histogram which showed
almost equal probability for the values, with a small increase in the
12-13 region. This is true for all 4 computers that I had running with
this kernel (Gigabyte 6BXD, 2xPIII-450, 3C905C, BayStack 350-24T switch).
All other values reported are normal (I also included DmaCtrl register,
page 95 in C docs).
I think that I failed to mention in my previous posts regarding this
issue that I also tried to completely disable the timer routine (as I
knew that the driver will get the autonegotiated value correct and
vortex_timer is not needed to re-enable interrupt - as required in case of
"Too much work...") while keeping the original driver (i.e. without
wait_for_completion) - it still crashed, but I eliminated vortex_timer as
a possible crash "producer".

So, clearly, the increased loop counter is not really useful in my case.
But as I said before, adding this function was the only thing that I have
done between a crashing and a non-crashing kernel (I assume that
functioning under continuous load for more than 7 days without glitch on
the 4 computers means that is non-crashing). As I suggested before, this
might be related to the fact that this solution included adding a jmp/ret
(function call) to the code which modifies the code flow (and cache
impact) - but I cannot offer a good explanation for this.

Sincerely,

Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De