[Beowulf] substantial RX packet drops during Pallas over e1000 (Rocks 4.1)
jeff.johnson at wsm.com
Sat May 27 17:00:43 PDT 2006
I have an update on this problem. Many list members emailed me with
suggestions (sysctl, rx buffers, etc). None of those had an affect on
the problem. Intel engineers had me edit the e1000 driver source to
repartition the fifo on the 82573V controller but that did not help either.
I swapped in an Intel 82546EB PCI-X/133 card on a fwe nodes and
retested. The dropped packets disappeared on the nodes with the PCI-X
nic. I added some PCIe adapters that were different than those on the
motherboard, still Intel. Reran Pallas and the nodes with the PCIe nic
still showed dropped packets.
It appears as if this issue affects PCIe only. PCI-X nics are
stable. I am running 220.127.116.11. PCIe support is enabled in the kernel.
The onboard PCIe nics are PCIe-x1. The PCIe card I added is PCIe-x4.
Does anyone have any thoughts as to why these dropped packets would only
appear under PCIe?
Running Rocks 4.1 on a 30 node system and seeing serious RX packet
loss, drops and overruns while running heavy MPI i/o over e1000. I have
replaced cabling, and switches, updated e1000 drivers, ran multiple
kernels, etc. No modifications seem to affect the issue. I am pursuing
a hardware resolution with Intel and Supermicro but I am posting here in
case someone has seen similar events.
30 nodes - Intel Pentium-D 840, 4GB RAM, 80GB SATA
Supermicro PDSMI motherboard
Intel 82573E and 82573L gigabit ethernet controllers
(only one network connected)
2.6.9-34.ELsmp /*and*/ 18.104.22.168
mpirun -nolocal -np 18 -machinefile /home/test/machines.20-29
/home/test/IMB-MPI1 Alltoall -npmin 18 -msglen /home/test/Lengths
(msglen values of 32, 256, 512 and 1024 have been run exclusively, each
resulting in packet drops)
Packet drop example: (other nodes post similar numbers)
RX packets:1843133 errors:0 dropped:1245 overruns:0 frame:0
TX packets:1764828 errors:0 dropped:0 overruns:0 carrier:0
I have tried increasing the e1000 RxDescriptors value to the maximum
of 4096 thinking that the Alltoall test may be overtasking receive
buffer resources but the drops still occur.
At Intel's advice I set arp filtering but it did nothing to change
the behavior of the problem. (/proc/sys/net/ipv4/conf/all/arp_filter)
More information about the Beowulf