[tulip] tulip based cluster on Cisco switch

Homer Wilson Smith homer@lightlink.com
Sat, 10 Jun 2000 11:42:44 -0400 (EDT)


    Also verify whether you are using 21143 or 21140 chips.

    The 21140 are well behaved as far as I can tell, the 21143 
are hopeless.

    Homer

------------------------------------------------------------------------
Homer Wilson Smith   Clear Air, Clear Water,  Art Matrix - Lightlink
(607) 277-0959       A Green Earth and Peace. Internet Access, Ithaca NY
homer@lightlink.com  Is that too much to ask? http://www.lightlink.com

On Sat, 10 Jun 2000, Brian D. Haymore wrote:

> Thanks for the tip.  We have been quite attentive to our nics and their
> link state.  But with two years of life into this cluster we have yet to
> see very many really show stopping problems.  You might want to verify
> that you have the latest cisco software in your switch.
> 
> --
> Brian D. Haymore
> University of Utah
> Center for High Performance Computing
> 155 South 1452 East RM 405
> Salt Lake City, Ut 84112-0190
> 
> Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366
> 
> On Sat, 10 Jun 2000, David Thompson wrote:
> 
> > 
> > I'll fess up here; we're Michael's customer in this case.  We are also seeing 
> > the changed MAC address with the .92 driver (vendor code 00:40:f0 instead of 
> > 00:c0:f0).  We put the new mac address in our dhcp server but still couldn't 
> > dhcp with the new driver.  The server is sending out replies, but the client 
> > doesn't seem to be getting them.  Caveat for anyone with dhcp and tulip 0.92...
> > 
> > Our sole reason for trying the .92 driver was see if we could use the  
> > 'options=' goo to force the whole cluster to 100BaseTX/fdx.  We have not been 
> > able to get these cards to auto-negotiaite properly and/or reliably with the 
> > Cisco switch with either the 'old' or 'new' transceivers, with either tulip 
> > 0.91 or 0.92.  Our hope was to forgoe auto-negotiation (which we normally 
> > prefer) because it seems to be borked with the hardware we have.  However, all 
> > attemps to force speed and duplex with either tulip driver version have failed.
> > 
> > Brian, you may want to check your 'netstat -i' from time to time, and/or make 
> > a pass through your cluster with tulip-diag to see if all your NICs are truly 
> > auto-negotiating properly with the switch(es).  We have seen situations with 
> > the where the auto-negotiation originally succeeds, but a couple days later we 
> > find the switch running 100/full and the card 100/half.  This causes bad 
> > things to happen wrt network performance.
> > 
> > --
> > Dave Thompson  <thomas@cs.wisc.edu>
> > 
> > Associate Researcher                    Department of Computer Science
> > University of Wisconsin-Madison         http://www.cs.wisc.edu/~thomas
> > 1210 West Dayton Street                 Phone:    (608)-262-1017
> > Madison, WI 53706-1685                  Fax:      (608)-262-6626
> > --
> > 
> > 
> > 
> > 
> > "Brian D. Haymore" wrote:
> > >We have a 170 node beowulf cluster all using this same network card.  We
> > >have found that many versions of the tulip driver produce the exact
> > >results you have seen.  Currently we have found that version .91 to be
> > >the most reliable for us.  Redhat 6.2 comes with a slightly newer
> > >version and we had issues with that.  We also tried the .92 version in
> > >module form from Donald Becker's site and found that this driver somehow
> > >got a completly different mac address for the card (wierd huh!).  We
> > >reported that bug and have not heard anything more beyond that.  So at
> > >is is we are still using .91 without any issues we can find.  We also
> > >are getting up to ~92 Mb/s.  Hope this helps.
> > >
> > >Michael Immonen wrote:
> > >> 
> > >> Hi all,
> > >> 
> > >> We have been struggling with an issue that has taken
> > >> many weeks to nearly solve. I am looking for a bit
> > >> more advice to bring this to a close.
> > >> 
> > >> Background:
> > >> We have an 100 node cluster, all with Kingston
> > >> KNE100TX NIC's. It is split it into two- 64 nodes and
> > >> 36
> > >> nodes. Each set is attached to its own Cisco Catalyst
> > >> 4000.
> > >> These systems were seeing several symptoms that we
> > >> eventually tied together:
> > >> 1. Ridiculously slow data transfer speeds- with the
> > >> nodes and switch configured for 100Mbps, data was
> > >> transferring at well below 10Mbps- varied in actual
> > >> value.
> > >> 2. Discovered severe packet loss due to carrier errors
> > >> as reported in /proc/net/dev Again highly
> > >> variable- poorly performing nodes could be anywhere
> > >> from 0.10% to 94.00% of transmitted packets
> > >> had carrier errors.
> > >> 3. All affected nodes were discovered to be operating
> > >> in half duplex where the switch and good nodes
> > >> were in full duplex. This was discovered using
> > >> tulip-diag.c
> > >> 
> > >> We thought we had a final solution when Kingston
> > >> assisted us in tracking a known hardware issue
> > >> related to some versions of the SEEQ MII transceiver.
> > >> They informed us that, under Linux, and with
> > >> some switches, there were several versions of the SEEQ
> > >> chip that had intermittent "timing issues". The
> > >> SEEQ (or LSI) 80223/C were known good chips, but the
> > >> SEEQ 80220/G and 80223/B would sometimes display this
> > >> behavior. The tricky part is that in some cases, they
> > >> were perfectly fine. Kingston did an excellent job
> > >> assisting us with the replacement of all 100 NIC's.
> > >> 
> > >> After all cards were swapped and the cluster was again
> > >> up and running, everything was beautiful- 100
> > >> FD all around. End of issue, or so we thought.
> > >> 
> > >> Nearly a week later, on checking the systems, 16 nodes
> > >> between the two sets were discovered to be
> > >> again in half duplex. (But with MUCH lower carrier
> > >> errors- 0.01% to 0.09%) And just a couple days
> > >> more the whole cluster was reported to be HD.
> > >> 
> > >> All systems had been running kernel 2.2.12 with
> > >> tulip.c v0.91, one system was updated to 2.2.15 with
> > >> v0.92, but this did not solve the issue.
> > >> 
> > >> I have spent some time scanning the tulip lists and
> > >> have gained some information there, but now also
> > >> have some more questions...
> > >> 
> > >> Now, for my questions:
> > >> Why would a running system renegotiate its network
> > >> setting without user intervention?
> > >> 
> > >> I am assuming that the current problem has to do with
> > >> the Cisco switch issue that Donald Becker
> > >> mentioned on April 27, 2000 on this list. If this is
> > >> the case would another type of ethernet chip still
> > >> experience the same problems?
> > >> 
> > >> Donald Becker has stated that forcing speed and duplex
> > >> is not recommended. Could forcing these cards
> > >> to 100 FD for this issue be a safe solution?
> > >> I have read that using HD is better. Why?
> > >> Should the nodes be forced to 100 HD?
> > >> 
> > >> Does anyone have any other recommendations or advice?
> > >> Maybe a recommended switch to replace the Cisco
> > >> switches?
> > >> 
> > >> Regards,
> > >> Michael
> > >> 
> > >> __________________________________________________
> > >> Do You Yahoo!?
> > >> Yahoo! Photos -- now, 100 FREE prints!
> > >> http://photos.yahoo.com
> > >> 
> > >> _______________________________________________
> > >> tulip mailing list
> > >> tulip@scyld.com
> > >> http://www.scyld.com/mailman/listinfo/tulip
> > >
> > >--
> > >Brian D. Haymore
> > >University of Utah
> > >Center for High Performance Computing
> > >155 South 1452 East RM 405
> > >Salt Lake City, Ut 84112-0190
> > >
> > >Email: brian@chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366
> > >
> > >_______________________________________________
> > >tulip mailing list
> > >tulip@scyld.com
> > >http://www.scyld.com/mailman/listinfo/tulip
> > 
> > 
> 
> 
> _______________________________________________
> tulip mailing list
> tulip@scyld.com
> http://www.scyld.com/mailman/listinfo/tulip
>