[tulip-bug] Possible deadlock with tulip driver

Sat, 16 Sep 2000 21:27:07 +0100

This is a multi-part message in MIME format.
--------------B2F0015A54061380D2D0A50B
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I seem to have encountered a possible deadlock using the latest tulip
driver which shows up in both 2.2.16 and 2.2.17. Long report attached.

ael
-- 
Dr A E Lawrence (from home)
--------------B2F0015A54061380D2D0A50B
Content-Type: text/plain; charset=us-ascii;
 name="tulip_bug.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="tulip_bug.txt"

Possible deadlock using tulip driver version v0.92 4/17/2000.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

History: 2.2.16
===============

I recently installed two Netgear FX310 NICs. I fetched and compiled the latest
version of the tulip driver as above: 0.92. I made a few tests, and all was 
well. The cards always correctly negotiated 100Mb full duplex mode, and FTP
trarsfer rates were about 5 MB/s with or without a switching hub. I tried a 
direct crossover connection between the two cards to see the effect of the hub.
This was using kernel 2.2.16 with the driver as a module.

The tests so far were essentially in a single direction during any one
experiment which was usually the ftp transfer of tar files of severel GB.

I then explored how much additional bandwidth I could achieve using full
duplex, and so set up concurrent tranfers in opposite directions. These were
both of tar files each of several GB. This transfer started and ran at a
few MB/sec in both directions for a substantial time (I think several minutes).
Transfers then stopped in both directions. And the drivers had to be restarted
before any more data could be transfered. this was done with a script that
invoked ifconfig eth0 up|down ... I did turn on debugging and examine the
logged messages, and everything seemed to point to a deadlocked driver, but I
had no time then to investigate further. I did however remove the hub and try
with the simple short crossover cable, and saw the same thing happen.
Several times, with and without the hub. Reproducible, but not exactly the
same each time: ie. moderately nondeterministic.

2.2.17
======

Meanwhile 2.2.17 arrived and I installed that and recompiled the latest
tulip driver again against those headers (just in case anything had changed).
And I fetched ttcp so that I could stress test more thoroughly and eliminate
disc drivers and the like from the tests.

A few unidirectioanl tests showed transfer bandwidth of around 10MB/s, so it 
was esentially saturating the network. Wonderful.

But then I went looking for the bug, and set up concurrent ttcp transfers
in both directions. At first things looked good, but as I suspected a deadlock,
I extended the size of the tranfer and the length of the test. And indeed, it
fell over eventually. I am writing this on one of the machines with the
ttcp transfers stalled.

The situation:
-------------

   -----------------                      _______________
   |  conquest2    |______________________| conquest3   |
   |   (pentium)   |                      | (pentium II)|
   ----------------                       _______________

Both machines have a pair of xterms with one transmitting and one receiving
ttcp.

Here are the the commands and results on the two windows on conquest3
with the previous successful smaller transfers which terminated correctly:

------- [conquest3 transmitting end] -----------------------

[root@conquest3 AL]# ttcp -t -n100000 conquest2 
ttcp-t: buflen=8192, nbuf=100000, align=16384/+0, port=5001  tcp  -> conquest2
ttcp-t: socket
ttcp-t: connect
ttcp-t: 819200000 bytes in 106.96 real seconds = 7479.39 KB/sec +++
ttcp-t: 100000 I/O calls, msec/call = 1.10, calls/sec = 934.92
ttcp-t: 0.2user 9.4sys 1:46real 9% 0i+0d 0maxrss 0+2pf 0+0csw

<<<success>>>

[root@conquest3 AL]# ttcp -t -n1000000 conquest2 
ttcp-t: buflen=8192, nbuf=1000000, align=16384/+0, port=5001  tcp  -> conquest2
ttcp-t: socket
ttcp-t: connect

<<<<deadlock here???? >>>>
-------------------------------------------------------------

-----------[ conquest3 receiving end ]-------------------------

[root@conquest3 AL]# ttcp -r conquest2 
ttcp-r: buflen=8192, nbuf=2048, align=16384/+0, port=5001  tcp
ttcp-r: socket
ttcp-r: accept from 192.168.0.2
ttcp-r: 819200000 bytes in 141.34 real seconds = 5659.99 KB/sec +++
ttcp-r: 599698 I/O calls, msec/call = 0.24, calls/sec = 4242.86
ttcp-r: 0.8user 10.2sys 2:21real 7% 0i+0d 0maxrss 0+2pf 0+0csw

<<<success>>>

[root@conquest3 AL]# ttcp -r conquest2 
ttcp-r: buflen=8192, nbuf=2048, align=16384/+0, port=5001  tcp
ttcp-r: socket
ttcp-r: accept from 192.168.0.2

<<< deadlock here ???>>>>>>>>>>>>
-------------------------------------------------------------------

I can't cut and paste the similar windows from conquest2 just now because the
ethernet connection doesn't work :-)  Telnet just times out, as does anything
else that depends on the ethernet link.

------------------------------------------------------------------------

/etc/conf.modules contains:-
alias eth0 tulip
options tulip debug=1

Just now I don't have debug logging set up properly at this end of the link,
but there is nothing interesting in /var/log/messages. Just

--------[ extracts from /var/log/messages ]------------------------------

kernel: tulip.c:v0.92 4/17/2000  Written by Donald Becker <becker@scyld.com> 
kernel:   http://www.scyld.com/network/tulip.html 
kernel: eth0: Lite-On 82c168 PNIC rev 32 at 0xd08f6000,
    00:A0:CC:D0:1E:96, IRQ 5. 
kernel: eth0:  MII transceiver #1 config 3000 status 7829 advertising 01e1. 
kernel: eth0: Setting full-duplex based on MII #1 link partner capability of
    45e1. 
network: Bringing up interface eth0 succeeded 
__________________________________________________________________________

The card:-
00:0a.0 Ethernet controller: Lite-On Communications Inc LNE100TX (rev 20)
        Subsystem: Netgear FA310TX
        Flags: bus master, medium devsel, latency 64, IRQ 5
        I/O ports at e800
        Memory at eb000000 (32-bit, non-prefetchable)
        Expansion ROM at ea000000 [disabled]

---------------------------------------------------------------------------

--------[ Output from tulip-diag during deadlock(?) ]----------------

[root@conquest3 tulip]# ./tulip-diag 
tulip-diag.c:v2.03 7/31/2000 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe800.
 Port selection is MII, full-duplex.
 Transmit started, Receive started, full-duplex.
  The Rx process state is 'Waiting for packets'.
  The Tx process state is 'Idle'.
  The transmit threshold is 256.
 Interrupt sources are pending!  CSR5 is 02670054.
   Tx out of buffers indication.
   Link passed indication.
   Rx Done indication.
 Use '-a' or '-aa' to show device registers,
     '-e' to show EEPROM contents, -ee for parsed contents,
  or '-m' or '-mm' to show MII management registers.
 ===============================================================

[root@conquest3 tulip]# ./tulip-diag -aa
tulip-diag.c:v2.03 7/31/2000 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe800.
 * A potential Tulip chip has been found, but it appears to be active.
 * Either shutdown the network, or use the '-f' flag to see all values.
 Port selection is MII, full-duplex.
 Transmit started, Receive started, full-duplex.
  The Rx process state is 'Waiting for packets'.
  The Tx process state is 'Idle'.
  The transmit threshold is 256.
 Interrupt sources are pending!  CSR5 is 02670055.
   Tx done indication.
   Tx out of buffers indication.
   Link passed indication.
   Rx Done indication.
=================================================================
[root@conquest3 tulip]# ./tulip-diag -mm
tulip-diag.c:v2.03 7/31/2000 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe800.
 Port selection is MII, full-duplex.
 Transmit started, Receive started, full-duplex.
  The Rx process state is 'Waiting for packets'.
  The Tx process state is 'Idle'.
  The transmit threshold is 256.
 Interrupt sources are pending!  CSR5 is 02670055.
   Tx done indication.
   Tx out of buffers indication.
   Link passed indication.
   Rx Done indication.
 MII PHY found at address 1, status 0x782d.
 MII PHY #1 transceiver registers:
   3000 782d 0040 6212 01e1 45e1 0003 0000
   0000 0000 0000 0000 0000 0000 0000 0000
   5000 0301 0000 0000 0000 0131 0100 0000
   003f f53e 0f00 ff00 002f 4000 80a0 000b.
=============================================================

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The deadlock was at "this" (conquest3) end. Resetting the driver
restored communication with conquest2. The conquest2 driver did not have to
be reset. I am pretty sure that in the 2.2.16 tests that the conquest2 driver
was the one that stalled. In this context, I am very confident in the 
integrity of the hardware on both machines, but I might be wrong. 

That the stall was deadlock rather than livelock is suggested by:

1) the inactivity of the link leds, confirmed by the tulip-diag status;
2) that neither CPU had any significant activity.

They in turn rule out a starvation pathology.

I will post this without investigating further for now. I guess this is mainly
for Donald, but maybe others have encountered similar problems? This does look
like a synchronisation failure somewhere and it does also look like a problem 
with the kernel or with the driver. Or more likely insufficiently specified
behaviour of the synchronisation primitives? But I am only guessing. I hope 
that I am wrong.

One note. I did erase a large file on conquest3 while the tranfer was in
progress. I did that partly to get more concurrent activity going, with a 
better chance of exposing any synchronisation problem, partly because I was
bored waiting for the test to complete, and also because the file needed
erasing. :-) Of course, the bug could be in one of the other drivers
that I invoked.

What tests should I carry out now? What further information would be most
useful in tracking this down? Can other people reproduce the problem? They may
have it but have never stressed the driver far enough.

Over to you all, especially to Donald. Who has written a magnificant set of 
drivers for us all.

Adrian

--------------B2F0015A54061380D2D0A50B--