[eepro100] EEpro100, Red Hat 7.1, wait_for_cmd_done timeout errors

David B. Ritch dritch@hpti.com
Sun Feb 16 05:21:52 2003


Bigger release numbers are apparently not necessarily better. ;-)

Unfortunately, we need to stay with 7.x in this case.  What kernels did
you use?  The stock RedHat kernels?  Some from www.kernel.org?

Apparently, with 7.1 and new drivers the problem was there, but with 6.2
and old drivers, it wasn't.  So - I wonder about 6.2 with the new
drivers, and about what differences other than kernel and drivers that
might affect this.  At one point, we had some ribbon cable PCI extenders
that cause all sorts of problems, and appeared to contribute to this. 
When we went to riser cards, the system was *much* more stable in many
ways, and this problem soon disappeared - until we moved the system.

Well, we're turning this cluster over to our client, so I won't have a
system that we know is exhibiting the problem to experiment with. 
However, the problem will bug me, and I suspect our client will, too, if
this arises again, so I'd like to have a better handle on it.

I guess I'll keep watching and listening.  Maybe someone will solve it. 
Or maybe it just won't come up again on my equipement...

Thanks!

dbr

On Fri, 2003-02-14 at 14:59, Jim Hribnak wrote:
> I had this problem with numerous nics running under RH 7.1 and on a dell
> 2450 server.
> 
> I ended up moving the 30 websites from that machine to a Dell 4400 poweredge
> server (Same Nics as other server) but I decided to install RedHat 6.2 (same
> server I have on a 4500 server running 500 web sites + same NICS)
> 
> Since doing that that server has not had a hiccup (KNOCK ON WOOD!!!!!!!!) in
> close to 3 weeks!  the 2450 is still online but has no traffic to it.  it as
> well has been online for 3 weeks.  I am not sure where to begin in saying
> what the problem could be as thge differences between a 2450 and and a 4400
> or 4500 server could be very different.  Could it be hardware? Possibly
> Could it of been redhat 7.1? possibly.
> 
> 
> I even tried new sycld drivers, I even ran the Intel Drivers.
> 
> On the RH 6.2 servers both are running 1999 drivers by becker and one other
> guy.,. all seems to be running fine.
> 
> Jim
> 
> 
> 
> ----- Original Message -----
> From: "David B. Ritch" <dritch@hpti.com>
> To: <eepro100@scyld.com>
> Sent: Thursday, February 13, 2003 10:34 AM
> Subject: [eepro100] EEpro100, Red Hat 7.1, wait_for_cmd_done timeout errors
> 
> 
> > There were a couple threads in January on this subject.  Was there ever
> > a resolution to this issue?
> >
> > I'm seeing some similar problems with the onboard NIC on a Tyan 2720
> > motherboard in a small cluster, and they're really strange.  A couple of
> > weeks ago, the problems stopped for no apparent reason.  Then we shipped
> > the system to a client site, and they came back.
> >
> > Pretty regularly, some of the nodes lose their ethernet after being up
> > for 90 minutes.  We are using the driver from Scyld, and sleep mode is
> > turned off.
> >
> > Since shipping the system seems to have triggered it again, I'm a little
> > suspicious of cables and connectors...
> >
> > Thanks,
> >
> > dbr
> > --
> > David B. Ritch
> > High Performance Technologies, Inc.
> > _______________________________________________
> > eepro100 mailing list
> > eepro100@scyld.com
> > http://www.scyld.com/mailman/listinfo/eepro100
-- 
David B. Ritch
High Performance Technologies, Inc.