Bogdan.Costescu at IWR.Uni-Heidelberg.De
Thu Oct 26 13:48:19 PDT 2000
On Wed, 25 Oct 2000, J. G. LaBounty wrote:
> > The head node works fine, but people have mentioned problems with the 3com
> > network card. testing has shown no problems but the vendor has informed us
> > they have had problems with the 3com cards, "some batches don't seem to
> > work", they have offered intel EtherExpress PRO 10/100+ TX - PCI cards for
> > the same cost.
> We were using the 3com 905b cards on 2 16 node clusters. Our application
> keeps the network pegged most of the time. We were getting network
> hangs about once every two weeks running RH6.1. We moved to RH6.2
> and switched to the 3c90x driver and problem happened about once per
> day. We have since changed out the 3com cards for the EtherExpress PRO
> 10/100 and have not seen the problem but we only have about 3 weeks of
> runtime on this configuration.
Sorry guys, but I don't quite get it!
The network is maybe the most important part of a cluster setup. And what
do you do about it ? "I heard that this card doesn't work right" or "It
seems that this card works better". While there is nothing wrong in asking
about card/driver combinations on this list, do you ALSO take a look at
archives of mailing list devoted to development of these drivers ?
And if you have a problem, do you report it on such a list ?
Or you just say: "OK, this card/driver combination is just crap, let's
change it." ? What if you still have problems after the change - will you
make another change ?
I encountered the same way of thinking on the NFS list...
For reference: http://www.scyld.com/network/index.html has links for
drivers (and more) while mailing list archives start at:
Going back to the 3Com problem: the driver that was present in kernels up
to around 2.2.15 was an old driver, based on Don's 0.99H and modified by
different people. It had a race which was only possible to happen in a
very narrow window; but 2-3 weeks of uptime under load give this window
the opportunity to happen (I know because I had exactly the same
problem). Now I have 3C905 B and C cards in UP and SMP nodes which have
uptimes of more than 2 months (we do upgrade kernels from time to time).
RH 6.1 had the "bad" driver; the original kernel from RH 6.2 also had it,
but the updated 2.2.16-3 has the new one; and this is the 3c59x driver,
not the 3c90x driver (which is written by 3Com).
If you trust Don's drivers more, his 3c59x is available from:
http://www.scyld.com/network/vortex.html and it includes (AFAIK) a
fix for this problem and much more.
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
More information about the Beowulf