link negogiation with switch, Netgear crashes, and more

Mike E. Ciholas mikec@lcs.mit.edu
Tue Jul 13 23:55:01 1999


Donald,

You may remember me as a partner in FlowNet with Erann Gat, I
delivered FN boards to you by flying into College Park airport.  I can
fill you in on FlowNet status if you want, but I need your help in the
area of linux drivers.

I recently acquired a 10/100 switch and thought all I had to do was
take out my 10 hub and swap it in.  I already had 10/100 NICs in most
computers.  Easy! I thought.

Well, it was not to be.  My SMC9332DST cards can't make any sort of
connection with the Linksys EZXS88W.  I then bought other NICs (a
Linksys LNE100TX V2.0 and a Netgear FA310TX).  I discovered some
strange behaviour.  Here are my tests:

Test setup:

Unless otherwise mentioned, there are three computers and a switch
involved:

Piano is a P133 desktop machine with an Intel Atlantis motherboard.
Piano runs RedHat 5.2.

Harp is a PII-400 desktop machine with a BH6 (?) motherboard.  Harp
runs RedHat 6.0 and Windows 98.

Harmonica is a Hitachi VisionBook Pro 7775 laptop, PII-300, with a
built in 21143 Ethernet port.  Harmonica runs RedHat 6.0 and Windows
98.

The switch is a Linksys EZXS88W with 8 ports.  It has lights for LINK,
FD, 100, and COL for each port.  Connected to the switch is an 8 port
10baseT hub which is connected to some 10 only devices (ISDN router,
Appletalk router).

Test results:

My notes are followed by output from /var/log/messages upon driver
loading, output from tulip-diag, and output from ifconfig eth0.

The machines get their IP address from DHCP (and then their names from
reverse DNS), so don't be confused when they get strange identities (I
did not edit dhcpd.conf for every NIC change).

Test 1:

Harp with SMC9332DST (purchased 2-3 years ago).

This setup worked fine with my 10baseT hub for several years.

Fails to initiate connection and network is not brought up.  During
negogiation, the 100 and FD lights do not come on.  Briefly, the COL
lights up coincident with relay switching (suspect 100 mode sets off
COL light?).  After 4 relay clicks, driver gives up.  Switch is left
in state with only LINK lit.  After the OS boots (without network),
the status is.

Jul 12 22:29:36 localhost kernel: tulip.c:v0.91e 5/27/99
becker@cesdis.gsfc.nasa.gov
Jul 12 22:29:36 localhost kernel: eth0: Digital DS21140 Tulip rev 18
at 0xe400,
00:00:C0:A5:F4:D0, IRQ 10.
Jul 12 22:29:36 localhost kernel: eth0: Old format EEPROM on
'SMC9332DST' board.  Using substitute media control info.
Jul 12 22:29:36 localhost kernel: eth0:  EEPROM default media type
Autosense.
Jul 12 22:29:36 localhost kernel: eth0:  Index #0 - Media 10baseT (#0)
described by a 21140 non-MII (0) block.
Jul 12 22:29:36 localhost kernel: eth0:  Index #1 - Media 10baseT-FD
(#4) described by a 21140 non-MII (0) block.
Jul 12 22:29:36 localhost kernel: eth0:  Index #2 - Media 100baseTx
(#3) described by a 21140 non-MII (0) block.
Jul 12 22:29:36 localhost kernel: eth0:  Index #3 - Media 100baseTx-FD
(#5) described by a 21140 non-MII (0) block.

tulip-diag.c:v1.10 4/12/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Index #1: Found a Digital DS21140 Tulip adapter at 0xe400.
 Port selection is 10mpbs-serial, half-duplex.
 Transmit stopped, Receive stopped, half-duplex.
  The Rx process state is 'Stopped'.
  The Tx process state is 'Stopped'.
  The transmit threshold is 72.
 Use '-a' to show device registers,
     '-e' to show EEPROM contents,
  or '-m' to show MII management registers.

eth0      Link encap:Ethernet  HWaddr 00:00:C0:A5:F4:D0  
          BROADCAST DEBUG  MTU:1500  Metric:1
          RX packets:17 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:4 dropped:0 overruns:0 carrier:8
          collisions:0 txqueuelen:100 
          Interrupt:10 Base address:0xe400

Test 2:

Replace harp's NIC with a Linksys LNE200TX (version 2.0).

Jul 12 22:46:23 cello kernel: tulip.c:v0.91e 5/27/99
becker@cesdis.gsfc.nasa.gov
Jul 12 22:46:23 cello kernel: eth0: Lite-On PNIC-II rev 37 at 0xe400,
00:A0:CC:32:11:47, IRQ 10.

tulip-diag.c:v1.10 4/12/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Index #1: Found a Lite-On PNIC-II adapter at 0xe400.
 Port selection is 100mbps-SYM/PCS 100baseTx scrambler, full-duplex.
 Transmit started, Receive started, full-duplex.
  The Rx process state is 'Waiting for packets'.
  The Tx process state is 'Idle'.
  The transmit threshold is 128.
 Use '-a' to show device registers,
     '-e' to show EEPROM contents,
  or '-m' to show MII management registers.

eth0      Link encap:Ethernet  HWaddr 00:A0:CC:32:11:47  
          inet addr:192.168.1.201  Bcast:192.168.1.255
Mask:255.255.255.0
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:214 errors:0 dropped:0 overruns:0 frame:0
          TX packets:239 errors:3 dropped:0 overruns:0 carrier:4
          collisions:0 txqueuelen:100 
          Interrupt:10 Base address:0xe400 

Things seem to work fine.  FTP get from piano to harp results in 3.1
MB/s, a put results also in 3.1 MB/s.  The output from ifconfig is
nominal.

Test 3:

Test harmonica.  Since it is a laptop, I cannot replace the Ethernet
card, so I have to make it work.  The chip is a 21143.

Upon booting, everything seems fine, but there is trouble lurking.
The switch has the LINK, FD, and 100 lights lit.  They stay lit during
the entire booting process.

Jul 12 23:08:01 harmonica kernel: tulip.c:v0.91e 5/27/99
becker@cesdis.gsfc.nasa.gov
Jul 12 23:08:01 harmonica kernel: eth0: Digital DS21143 Tulip rev 65
at 0xfc00,
00:80:C8:49:75:05, IRQ 11.
Jul 12 23:08:01 harmonica kernel: eth0:  EEPROM default media type
Autosense.
Jul 12 23:08:01 harmonica kernel: eth0:  Index #0 - Media MII (#11)
described by a 21142 MII PHY (3) block.
Jul 12 23:08:01 harmonica kernel: eth0:  MII transceiver #17 config
1000 status
782d advertising 01e1.

tulip-diag.c:v1.10 4/12/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Index #1: Found a Digital DS21143 Tulip adapter at 0xfc00.
 Port selection is MII, half-duplex.
 Transmit started, Receive started, half-duplex.
  The Rx process state is 'Waiting for packets'.
  The Tx process state is 'Idle'.
  The transmit threshold is 128.
 MII PHY found at address 17, status 0x782d.
 MII PHY #17 transceiver registers:
   1000 782d 7810 0001 0001 45e1 0001 0000
   0000 0000 0000 0000 0000 0000 0000 0000
   0000 0000 4000 0000 3ffb 0010 0000 0002
   0001 0000 0000 0000 0000 0000 0000 0000.
  Internal autonegotiation state is 'Autonegotiation disabled'.

eth0      Link encap:Ethernet  HWaddr 00:80:C8:49:75:05  
          inet addr:192.168.1.14  Bcast:192.168.1.255
Mask:255.255.255.0
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:65 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:69 dropped:0 overruns:0 carrier:69
          collisions:0 txqueuelen:100 
          Interrupt:11 Base address:0xfc00

Note that agreement on FD was *NOT* reached, and that ifconfig reports
only TX errors due to carrier.  However, the network appears to work.

It is only during heavy loads that the problems become apparent.  An
ftp get from piano to harmonica results in 1.1 MB/s with ifconfig
reporting an enormous number of carrier TX errors (over 25,000 for a
65 MB file).  A few TX packets are listed as sent without errors.  An
ftp put results in 2.1 MB/s and with nearly 50,000 more TX carrier
errors.

If I run the tests with harp (LNE100TX V2.0 installed), the results
are even worse.  The faster computer and the lack of agreement on
duplex must be causing harp to think it sees collisions when the
switch means to send it FD packets.  An ftp get results in a dismal
0.28 MB/s while a put gives me 1.5 MB/s.  Clearly the faster computer
causes bad timing on the wire somehow.

I've tried to use the "options=30" to force full duplex and MII
100baseTX but that doesn't work either.  I think I have tried all
numbers between 0 and 31 eventually.

I also tried driver version 0.89L (so called "pcmcia version" that you
directed someone else to use on an earlier Hitachi laptop).  This did
not work either.

Now for the kicker: I booted harmonica into Windows98 and using the
driver supplied by Hitachi on their web site, I was able to duplicate
the problem.  That is, the performance was dismal and asymetric.  I do
not have any diagnostic tools to tell me what the card thought it was
doing but the results seem conclusive that the Windows driver is no
better than yours.

The Windows driver on an ftp get (harp to harmonica) got 0.17 MB/s.
On an ftp put (harmonica to harp) the throughput was much higher,
about 2-3 MB/s.  This performance was true even if 100baseTX-FD was
explicitly selected as the media type.

Test 4:

Harp has a Netgear FA310TX card installed.

Harp reports:

Jul 12 23:35:46 harp kernel: tulip.c:v0.91e 5/27/99
becker@cesdis.gsfc.nasa.gov 
Jul 12 23:35:46 harp kernel: eth0: Lite-On 82c168 PNIC rev 33 at
0xe400, 00:A0:CC:3E:AB:A1, IRQ 10. 
Jul 12 23:35:46 harp kernel: eth0:  MII transceiver #1 config 1000
status 782d advertising 01e1. 

tulip-diag.c:v1.10 4/12/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe400.
 Port selection is MII, full-duplex.
 Transmit started, Receive started, full-duplex.
  The Rx process state is 'Waiting for packets'.
  The Tx process state is 'Idle'.
  The transmit threshold is 128.
 MII PHY found at address 1, status 0x782d.
 MII PHY #1 transceiver registers:
   1000 782d 7810 0000 01e1 05e1 0003 0000
   0000 0000 0000 0000 0000 0000 0000 0000
   0000 0000 4000 0000 3ffb 0010 0000 0002
   0001 0000 0000 0000 0000 0000 0000 0000.

eth0      Link encap:Ethernet  HWaddr 00:A0:CC:3E:AB:A1  
          inet addr:192.168.1.12  Bcast:192.168.1.255
Mask:255.255.255.0
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:194 errors:0 dropped:0 overruns:0 frame:0
          TX packets:215 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          Interrupt:10 Base address:0xe400 

So everything is cool.  Now I try an ftp from harmonica to harp
(harmonica still suffers from duplex confusion).  The performance is
dismal as before, 0.27 MB/s, and the TX carrier errors are present.
However, on the ftp put (data from harmonica to harp), a strange thing
happens.  Harp looses the ability to network (stopping harmonica's
transfer dead in its tracks).  The state of the world is:

tulip-diag.c:v1.10 4/12/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe400.
 Port selection is MII, full-duplex.
 Transmit started, Receive started, full-duplex.
  The Rx process state is 'Suspended -- no Rx buffers'.
  The Tx process state is 'Idle'.
  The transmit threshold is 128.
 Interrupt sources are pending!  CSR5 is 026880d7.
   Tx done indication.
   Tx complete indication.
   Tx out of buffers indication.
   Link passed indication.
   Rx Done indication.
   Receiver out of buffers indication.
 Use '-a' to show device registers,
     '-e' to show EEPROM contents,
  or '-m' to show MII management registers.

eth0      Link encap:Ethernet  HWaddr 00:A0:CC:3E:AB:A1  
          inet addr:192.168.1.12  Bcast:192.168.1.255
Mask:255.255.255.0
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:41352 errors:0 dropped:0 overruns:0 frame:0
          TX packets:65284 errors:11 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          Interrupt:10 Base address:0xe400 

And there the world ends as far as network is concerned.  This
identical test was rock solid with the Linksys LNE100TX V2.0 NIC, but
the Netgear design (with an external PHY chip) should be older and one
would think more maturely supported in the driver.  But I've clearly
exposed some sort of race condition that causes interrupts to be
pending but unserviced.  My test file (65MB) gets about 3-4 MB into it
before crashing, it never has completed successfully.

When harmonica is in Windows, harp never crashes.  This may simply due
to the lower throughput of Windows (about 1.7 MB/s) so there is either
less chance to hit the race condition or the fault is load dependent.

Conclusion
==========

Do I buy another switch?  The might explain the SMC9332DST and Hitachi
VBPro 21143 problems with negogiation.

Do I buy different NICs?  The Netgear seems to have some problem under
high load, and the Linksys LNE100TX V2.0 seems solid, but I can't
replace the laptop chip.

Do I hope for your insight and clever programming to find the fault?
That would be the best option...

I know your time is precious, so I've tried to be as thorough as
possible in my testing.  What can I do that would help now?

Mike Ciholas                            (812) 858-1355 voice
CIHOLAS Enterprises                     (812) 858-1360 fax
5855 Fiesta Drive                       mikec@flownet.com
Newburgh, IN 47630                      mikec@ciholas.com