[vortex] 3c905-errors

Steven Timm timm@fnal.gov
Mon Oct 6 18:18:01 2003


Steve Timm responding to Donald Becker:
On Mon, 6 Oct 2003, Donald Becker wrote:

> On Mon, 6 Oct 2003, Steven Timm wrote:
>
> > We have had the following hardware for about a year now
> > and continue to have mysterious and intermittent network problems on it.
>
> > Configuration: Tyan 2466 motherboard, built-in NIC: 3C905TX., dual
> > AMD MP2000+ processors, 760MPX chipset. 1 GB RAM.  Typical mode of
> > usage is fast burst transfer of large files, 1-2 GB.  Both NIC
> > and switch are configured to auto-negotiate.
>
> > 01:05.0 VGA compatible controller: ATI Technologies Inc Rage Mobility P/M AGP 2x (rev 64)
>
> Are you running the video controller in graphics mode or text mode?
> (Yes, it _might_ make a difference.  Some video controllers are notorious
> for hogging bus bandwidth and violating bus-hold-time specs.)

I don't know.  There is no X running on the machine if that is
what you are asking.

>
> > 02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
>
> Important detail: no SCSI controllers.
>
> > dmesg shows:
> > 02:08.0: 3Com PCI 3c905C Tornado at 0x3000. Vers LK1.1.18-ac
> >  00:e0:81:23:66:5b, IRQ 19
> >   product code 0000 rev 00.6 date 00-00-00
>
> Ehhhh, this makes me suspicious that other vital parameters from the
> EEPROM are bogus.  Specifically the settings that tell the BIOS the PCI
> burst bandwidth requirements.
>
It's this way on all our boards.  See below.

>
> > Two:
> > Oct  5 23:53:33 fnd0139 kernel: eth0: Setting half-duplex based on MII #24 link partner capability of 0000.
> > Oct  5 23:54:33 fnd0139 kernel: eth0: Setting full-duplex based on MII #24 link partner capability of 41e1.
> >
> > These errors are associated with increases in the count of
> > transmit carrier errors in /proc/net/dev.  They do, for about a minute,
> > lead to loss of network connectivity with the node.
>
> Yup, they would.  They seem to be reporting a loss of link beat.
>
> > necessarily restarted and it does cause trouble.  The link light
> > of the NIC goes out during this time.
>
> That's a vital detail.
> The link LEDs going out strongly indicates that it's not the driver
> having a problem reading the status register, but rather the switch
> dropping the link.
>
> This can be verified by running 'mii-diag' while the link down.

Since it's down only a minute and we have a time lag in our
reporting process it is difficult to catch one while the link is down
(and more difficult to log into it during that time).  We would have to
verify the data above.


>
> >  We have seen this problem
> > happen on all of 240 different nodes in a cluster,
>
> Simultaneously, or a random pattern?
>
It's a quasi-random pattern, and happens even when all the nodes
are idle.  Typically on a few nodes at once.



> > The problem
> > happens with all of them.  Interestingly enough, the switch does
> > not record any errors in its error counters during these episodes.
>
> Hmmm, it definitely should record a re-negotiation.

I need to verify the link light information above.

>
> > may be causing this problem..in particular is there any
> > record that Tyan may have misconfigured this NIC when they put it
> > on their board?
>
> The bogus EEPROM sections hint that they didn't take any special care
> with the implementation.
>
> > Errors of this type:
> > nfs: server d0bbin-farm not responding, still trying
> > eth0: transmit timed out, tx_status 00 status e601.
> >   diagnostics: net 0ccc media 8880 dma 0000003a.
> > eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
>
> This is very different case.
> It likely a problem with the kernel's IRQ handling.
> The device driver is reporting that the hardware raised an interrupt,
> yet the interrupt handler was never called.
>
> > On the machine that gave this error:
> > [root@fnd0228 bin]# ./mii-diag
>
> This isn't a tranceiver problem.  Check /proc/interrupts to see if the
> IRQ count increases.
>
> A work-around for the underlying bug might be to pass 'noapic' as a
> kernel option.
>
> > [root@fnd0228 bin]# ./vortex-diag -a
> > vortex-diag.c:v2.14 12/28/2002 Donald Becker (becker@scyld.com)
> ...
> >    Tx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to empty.
> >    Rx FIFO thresholds: min. burst 256 bytes, priority with 128 bytes to full.
> ...
> >    Maximum burst recorded Tx 0,  Rx 352.
>
> Ehhh, that's a rather short burst.  The chip is capable of transferring
> a whole 1500 byte packet to memory in a single burst.  On PCI 2.1, the
> bus parameters should be set up to allow this.
>
> > [root@fnd0228 bin]# ./vortex-diag -e
>
> What does '-eee' (show all details) report?
>
> > Saved EEPROM settings of a 3Com Vortex/Boomerang:
> >  3Com Node Address 00:E0:81:23:66:5B (used as a unique ID only).
> >  OEM Station address 00:E0:81:23:66:5B (used as the ethernet address).
> >   Device ID 9200,  Manufacturer ID 6d50.
> >   Manufacture date (MM/DD/YYYY) 0/0/2000, division , product .
> >   No BIOS ROM is present.
> >  Transceiver selection: Autonegotiate.
> >    Options: negotiated duplex, link beat required.
> >  PCI Subsystem IDs: Vendor 10f1 Device 2466.
> >  100baseTx 10baseT.
> >   Vortex format checksum is incorrect (29 vs. 10f1).
> >   Cyclone format checksum is incorrect (0xb5 vs. 0xff).
> >   Hurricane format checksum is incorrect (0x68 vs. 0xff).
>
> They didn't get the checksum right.

Here's output of vortex-diag -eee.


[root@fnd0228 D0SB]# /usr/local/bin/vortex-diag -eee
vortex-diag.c:v2.14 12/28/2002 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a 3c905C Tornado 100baseTx adapter at 0x3000.
 Station address 00:e0:81:23:66:5b.
  Receive mode is 0x07: Normal unicast and all multicast.
EEPROM format 64x16, configuration table at offset 0:
    00: 00e0 8123 665b 9200 0000 0000 0000 6d50  __#_[f________Pm
  0x08: 2940 0000 00e0 8123 665b 0010 0000 00aa  @)____#_[f______
  0x10: 72a2 0000 0000 0180 0000 0004 1421 10f1  _r__________!___
  0x18: 2466 000a 0002 6300 ff43 4343 ffff ffff  f$_____cC_CC____
  0x20: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
      ...

 The word-wide EEPROM checksum is 0x5d2c.
Saved EEPROM settings of a 3Com Vortex/Boomerang:
 3Com Node Address 00:E0:81:23:66:5B (used as a unique ID only).
 OEM Station address 00:E0:81:23:66:5B (used as the ethernet address).
  Device ID 9200,  Manufacturer ID 6d50.
  Manufacture date (MM/DD/YYYY) 0/0/2000, division , product .
  No BIOS ROM is present.
 Transceiver selection: Autonegotiate.
   Options: negotiated duplex, link beat required.
 PCI Subsystem IDs: Vendor 10f1 Device 2466.
 100baseTx 10baseT.
  Vortex format checksum is incorrect (29 vs. 10f1).
  Cyclone format checksum is incorrect (0xb5 vs. 0xff).
  Hurricane format checksum is incorrect (0x68 vs. 0xff).
[root@fnd0228 D0SB]#

So is there anything that can be done to get the checksums
right and increase the burst setting?  Should we be taking
this up with tyan tech support, or is there anything in
the vortex-diag utility or associated utilities that would
allow us to flash these eeproms from Linux?

Steve Timm




>
> > to operate without errors.  Replacing the motherboard also fixes
> > this problem.  Is there something short of either of these
> > two solutions which could be done to stop this from happening?
> ..
> > We have actually thought about buying different network cards
> > to put into these machines, but are reluctant to do so because
> > we believe this would interfere with being able to PXE-boot
> > these machines.
>
> If you decide to do this, buy network adapters with their own PXE boot
> ROMs.  The motherboard BIOS PXE ROM will not work, even with other
> 3c905c cards.
>
> --
> Donald Becker				becker@scyld.com
> Scyld Computing Corporation		http://www.scyld.com
> 914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
> Annapolis MD 21403			410-990-9993
>
>