[eepro100] EEpro100, Red Hat 7.1, wait_for_cmd_done timeout errors

David B. Ritch dritch@hpti.com
Sun Feb 16 05:21:47 2003


Thank you for your suggestion.  However, I'm already using the current
version of the eepro100 driver from the Scyld web site, 1.26, and the
2.4.20 kernel.

When we shipped a 48-node cluster to a client site, 10 of the nodes
experienced this problem.  I've read of an issue with the cards not
being initialized properly after a cold boot, so we rebooted them.  This
did not help.  We reseated the PCI riser cards in some of them and
rebooted all of them.  Magically, all but one came up and stayed up.  We
replaced the ethernet cable on that one, and it came up and stayed.

Moving from the e100 driver and the eepro100 driver to the current
driver from Scyld has resulted in improved stability, but we still see
this sort of problem.  It's a bit disturbing....

And then the system stabilizes for no apparent reason - so I don't know
what may destabilize it in the future.

dbr

On Fri, 2003-02-14 at 03:09, Alexander Tarkhov wrote:
> David,
> 
> An overview would be like this:
> There is a work-around that works in some cases -- upgraded Scyld driver
> or driver from chip vendor.
> There is no pretty much explanation and even less resolution to this
> problem.
> It seems like the problem originates from so deep into the chip design
> and architecture, that it's just
> not worth explaining in list like this - something about hw being
> ready/not ready in unpredictable times.
> (God, what happend to my English?)
> 
> But what is good about your case - you are the first to report about
> multiple failures ("...some of the nodes...").
> Because what we have seen before were just distinct unlucky specimen...
> (Can't find the plural for this word in the Dictionary, pls. tell me
> someone?)
> So probably you experience real incompatibility of driver version /
> kernel version,
> which means the driver upgrade might help...
> 
> Good luck!
> Alexander
> 
> David B. Ritch wrote:
> 
>  >There were a couple threads in January on this subject.  Was there ever
>  >a resolution to this issue?
>  >
>  >I'm seeing some similar problems with the onboard NIC on a Tyan 2720
>  >motherboard in a small cluster, and they're really strange.  A couple of
>  >weeks ago, the problems stopped for no apparent reason.  Then we shipped
>  >the system to a client site, and they came back.
>  >
>  >Pretty regularly, some of the nodes lose their ethernet after being up
>  >for 90 minutes.  We are using the driver from Scyld, and sleep mode is
>  >turned off.
>  >
>  >Since shipping the system seems to have triggered it again, I'm a little
>  >suspicious of cables and connectors...
>  >
>  >Thanks,
>  >
>  >dbr
>  >
> 
> 
> 
> _______________________________________________
> eepro100 mailing list
> eepro100@scyld.com
> http://www.scyld.com/mailman/listinfo/eepro100
-- 
David B. Ritch
High Performance Technologies, Inc.