[eepro100] Dell 4400 instability with eepro100 driver...

Ben Greear greearb@candelatech.com
Sun Feb 17 01:19:01 2002


Henrik Schmiediche wrote:

>     Hello Ben,
> I have exchanged the RAM to completely different RAM and decreased it to
> 1GB. No success. I have also tried the 2.4.17 kernel with and without IRQ
> Rate patch with no suceess. Granted I have not tried the 2.4.17 kernel with
> latest eepro100 drivers --- I tried this when I was still using the stock
> eepro100 and the Intel e100 drivers. The number of driver/kernel/RAM
> permutations is getting ridiculous.


Well, I have done extensive testing on many different EEPRO NICS, and
other than a lockup bug when connected to a 10bt port, both eepro100 and e100
have been remarkably stable.  For my tests, I often run 50Mbps tx + 50Mbps rx
for 24 hour periods...  I'm generally running on PIII 1Ghz class machines,
single CPU, with 128 - 256MB of RAM.

The latest eepro100 is supposed to fix the 10bt lockup bug, but I hope
you aren't running your beast against a 10bt hub anyway :)


> 
> I have not run memtest. Where do I find this?


Search google...I don't know offhand.  The reason I mentioned it
is because of the error in your logs.

One final suggestion:  If you are using PCI riser cards, consider
testing w/out the riser card (this can mean taking off face plates
or other screwdriver hacking, of course).  I have found many lockups
with cheap riser cards, but usually with 4-port NICS, not single
port eepro nics...

Good luck!
Ben


> 
> Sincerely,
> 
>      - Henrik
> 
> 
> ----- Original Message -----
> From: "Ben Greear" <greearb@candelatech.com>
> To: "Henrik Schmiediche" <henrik@stat.tamu.edu>
> Cc: <eepro100@scyld.com>
> Sent: Saturday, February 16, 2002 10:39 PM
> Subject: Re: [eepro100] Dell 4400 instability with eepro100 driver...
> 
> 
> 
>>Have you tried running memtest to see if your memory is
>>good?  Might want to try limiting the machine to 1 or 2 GB
>>of RAM to see if the problems go away.  (Not a cure, but it
>>will tell us something useful.)
>>
>>Henrik Schmiediche wrote:
>>
>>
>>>      Hello,
>>>I have a single processor Dell 4400 server with 4GB of RAM that I cannot
>>>
> get
> 
>>>to run stable under high network loads (NFS, remote backups). I am about
>>>ready to trash this system and go back to a Sun. I am running RH 7.2
>>>
> with
> 
>>>2.4.9-13 and I have used the stock eepro100 drivers that come with RH,
>>>
> the
> 
>>>latest Intel 1.6.29 drivers and the latest eepro100 drivers and all of
>>>
> them
> 
>>>lock up. I also get lockups (WATCHDOG/timeout) when I install a 3com
>>>
> 3c905C
> 
>>>card (though I have not tried the latest drivers for this card from the
>>>scyld website). I have also tried changing to an external eepro100 card
>>>(instead of using the buildin one) with no success. When I installed the
>>>latest eepro100 drivers I get this NMI message which may be related to
>>>
> the
> 
>>>lockups, but I am not sure... I have tried changing RAM with no success.
>>>
>>>Feb 16 07:58:38 s0 kernel: eepro100.c:v1.20 1/28/2002 Donald Becker
>>><becker@scyld.com>
>>>Feb 16 07:58:38 s0 kernel:   http://www.scyld.com/network/eepro100.html
>>>Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received. Dazed and confused, but
>>>trying to continue
>>>Feb 16 07:58:38 s0 kernel: You probably have a hardware problem with
>>>
> your
> 
>>>RAM chips
>>>Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received. Dazed and confused, but
>>>trying to continue
>>>Feb 16 07:58:38 s0 kernel: You probably have a hardware problem with
>>>
> your
> 
>>>RAM chips
>>>Feb 16 07:58:38 s0 kernel: Uhhuh. NMI received for unknown reason 25.
>>>Feb 16 07:58:38 s0 kernel: Dazed and confused, but trying to continue
>>>Feb 16 07:58:38 s0 kernel: Do you have a strange power saving mode
>>>
> enabled?
> 
>>>Feb 16 07:58:38 s0 kernel: eth0: Intel i82559 rev 8 at 0xf899f000,
>>>00:B0:D0:20:87:60, IRQ 14.
>>>Feb 16 07:58:38 s0 kernel:   Board assembly 07195d-000, Physical
>>>
> connectors
> 
>>>present: RJ45
>>>Feb 16 07:58:38 s0 kernel:   Primary interface chip i82555 PHY #1.
>>>Feb 16 07:58:38 s0 kernel:   General self-test: passed.
>>>Feb 16 07:58:38 s0 kernel:   Serial sub-system self-test: passed.
>>>Feb 16 07:58:38 s0 kernel:   Internal registers self-test: passed.
>>>Feb 16 07:58:38 s0 kernel:   ROM checksum self-test: passed
>>>
> (0x04f4518b).
> 
>>>Feb 16 07:58:38 s0 kernel:   Receiver lock-up workaround activated.
>>>
>>>The error message I get (a whole lot of them):
>>>
>>>Feb 15 23:35:22 s0 kernel: Command 0080 was not immediately accepted,
>>>
> 10001
> 
>>>ticks!
>>>Feb 15 23:35:54 s0 last message repeated 19 times
>>>Feb 15 23:36:00 s0 last message repeated 3 times
>>>Feb 15 23:36:04 s0 kernel: eth0: Transmit timed out: status 0090  0080
>>>
> at
> 
>>>25279986/25280017 commands 000ca000 000c0000 000c0000.
>>>Feb 15 23:36:04 s0 kernel: Command 0080 was not immediately accepted,
>>>
> 10001
> 
>>>ticks!
>>>Feb 15 23:36:04 s0 kernel: eth0: Restarting the chip...
>>>Feb 15 23:36:04 s0 kernel: Command 0070 was not accepted after 10001
>>>
> polls!
> 
>>>Feb 15 23:36:08 s0 kernel: eth0: Transmit timed out: status 0000  0010
>>>
> at
> 
>>>25279986/25280018 commands 000ca000 000c0000 000c0000.
>>>Feb 15 23:36:08 s0 kernel: eth0: Restarting the chip...
>>>
>>>A few additional comments:
>>>
>>>   - I cannot recover from this except with a reboot. At least I do not
>>>
> know
> 
>>>how.
>>>   - The eepro100  card shares an interrupt with the SCSI controller. Is
>>>there a way to reassign the IRQ of the eepro100 card?
>>>   - The system is even more unstable when I install a second CPU.
>>>
>>> Any ideas on what to try? Bad motherboard?
>>>
>>>Sincerely,
>>>
>>>      -  Henrik
>>>
>>>          CPU0
>>>  0:    5115449          XT-PIC  timer
>>>  1:       1875          XT-PIC  keyboard
>>>  2:          0          XT-PIC  cascade
>>>  5:         30          XT-PIC  aic7xxx
>>>  8:          1          XT-PIC  rtc
>>> 10:   36396202          XT-PIC  aic7xxx
>>> 11:          0          XT-PIC  usb-ohci
>>> 12:       3151          XT-PIC  PS/2 Mouse
>>> 14:    6638596          XT-PIC  aic7xxx, eth0
>>>NMI:          3
>>>ERR:          0
>>>
>>>PCI devices found:
>>>  Bus  0, device   0, function  0:
>>>    Host bridge: ServerWorks CNB20LE Host Bridge (rev 5).
>>>      Master Capable.  Latency=48.
>>>  Bus  0, device   0, function  1:
>>>    Host bridge: ServerWorks CNB20LE Host Bridge (#2) (rev 5).
>>>      Master Capable.  Latency=48.
>>>  Bus  0, device  17, function  0:
>>>    Host bridge: ServerWorks CNB20LE Host Bridge (#3) (rev 5).
>>>      Master Capable.  Latency=48.
>>>  Bus  0, device  17, function  1:
>>>    Host bridge: ServerWorks CNB20LE Host Bridge (#4) (rev 5).
>>>      Master Capable.  Latency=48.
>>>  Bus  0, device   4, function  0:
>>>    Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev
>>>
> 8).
> 
>>>      IRQ 14.
>>>      Master Capable.  Latency=32.  Min Gnt=8.Max Lat=56.
>>>      Non-prefetchable 32 bit memory at 0xfeb02000 [0xfeb02fff].
>>>      I/O at 0xfcc0 [0xfcff].
>>>      Non-prefetchable 32 bit memory at 0xfe900000 [0xfe9fffff].
>>>  Bus  0, device   6, function  0:
>>>    VGA compatible controller: ATI Technologies Inc 3D Rage IIC (rev
>>>
> 122).
> 
>>>      Master Capable.  Latency=32.  Min Gnt=8.
>>>      Prefetchable 32 bit memory at 0xfd000000 [0xfdffffff].
>>>      I/O at 0xf800 [0xf8ff].
>>>      Non-prefetchable 32 bit memory at 0xfeb01000 [0xfeb01fff].
>>>  Bus  0, device  15, function  0:
>>>    ISA bridge: ServerWorks OSB4 South Bridge (rev 79).
>>>  Bus  0, device  15, function  2:
>>>    USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 4).
>>>      IRQ 11.
>>>      Master Capable.  Latency=32.  Max Lat=80.
>>>      Non-prefetchable 32 bit memory at 0xfeb00000 [0xfeb00fff].
>>>  Bus  6, device   4, function  0:
>>>    PCI bridge: PCI device 8086:0962 (Intel Corporation) (rev 1).
>>>      Master Capable.  Latency=32.  Min Gnt=6.
>>>  Bus  7, device   4, function  0:
>>>    SCSI storage controller: Adaptec 7899P (rev 1).
>>>      IRQ 10.
>>>      Master Capable.  Latency=32.  Min Gnt=40.Max Lat=25.
>>>      I/O at 0xcc00 [0xccff].
>>>      Non-prefetchable 64 bit memory at 0xfacff000 [0xfacfffff].
>>>  Bus  7, device   4, function  1:
>>>    SCSI storage controller: Adaptec 7899P (#2) (rev 1).
>>>      IRQ 5.
>>>      Master Capable.  Latency=32.  Min Gnt=40.Max Lat=25.
>>>      I/O at 0xc800 [0xc8ff].
>>>      Non-prefetchable 64 bit memory at 0xfacfe000 [0xfacfefff].
>>>  Bus  7, device   6, function  0:
>>>    SCSI storage controller: Adaptec AIC-7880U (rev 2).
>>>      IRQ 14.
>>>      Master Capable.  Latency=32.  Min Gnt=8.Max Lat=8.
>>>      I/O at 0xc400 [0xc4ff].
>>>      Non-prefetchable 32 bit memory at 0xfacfd000 [0xfacfdfff].
>>>
>>>[root@s0:/var/log]# mii-diag
>>>Using the default interface 'eth0'.
>>>Basic registers of MII PHY #1:  3000 782d 02a8 0154 05e1 41e1 0003 0000.
>>> The autonegotiated capability is 01e0.
>>>The autonegotiated media type is 100baseTx-FD.
>>> Basic mode control register 0x3000: Auto-negotiation enabled.
>>> You have link beat, and everything is working OK.
>>> Your link partner advertised 41e1: 100baseTx-FD 100baseTx 10baseT-FD
>>>10baseT.
>>>   End of basic transceiver information.
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>eepro100 mailing list
>>>eepro100@scyld.com
>>>http://www.scyld.com/mailman/listinfo/eepro100
>>>
>>>
>>>
>>
>>--
>>Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
>>President of Candela Technologies Inc      http://www.candelatech.com
>>ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear
>>
>>
>>_______________________________________________
>>eepro100 mailing list
>>eepro100@scyld.com
>>http://www.scyld.com/mailman/listinfo/eepro100
>>
>>
> 
> _______________________________________________
> eepro100 mailing list
> eepro100@scyld.com
> http://www.scyld.com/mailman/listinfo/eepro100
> 
> 


-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear