[Beowulf] substantial RX packet drops during Pallas over e1000 (Rocks 4.1)

Jeff Johnson jeff.johnson at wsm.com
Tue May 16 23:44:33 PDT 2006


Greetings,

    Running Rocks 4.1 on a 30 node system and seeing serious RX packet
loss, drops and overruns while running heavy MPI i/o over e1000. I have
replaced cabling, and switches, updated e1000 drivers, ran multiple 
kernels, etc. No  modifications seem to affect the issue. I am pursuing 
a hardware resolution with Intel and Supermicro but I am posting here in 
case someone has seen similar events.

    System details:
       30 nodes - Intel Pentium-D 840, 4GB RAM, 80GB SATA
             Supermicro PDSMI motherboard
             Intel 82573E and 82573L gigabit ethernet controllers
             (only one network connected)
             2.6.9-34.ELsmp  /*and*/   2.6.16.11
             e1000-7.0.38-1 driver

    Run details:
       mpirun -nolocal -np 18 -machinefile /home/test/machines.20-29
/home/test/IMB-MPI1 Alltoall -npmin 18 -msglen /home/test/Lengths
(msglen values of 32, 256, 512 and 1024 have been run exclusively, each 
resulting in packet drops)

   Packet drop example: (other nodes post similar numbers)
           RX packets:1843133 errors:0 dropped:1245 overruns:0 frame:0
           TX packets:1764828 errors:0 dropped:0 overruns:0 carrier:0

    I have tried increasing the e1000 RxDescriptors value to the maximum
of 4096 thinking that the Alltoall test may be overtasking receive
buffer resources but the drops still occur.

    At Intel's advice I set arp filtering but it did nothing to change 
the behavior of the problem. (/proc/sys/net/ipv4/conf/all/arp_filter)

Any ideas?

--Jeff









More information about the Beowulf mailing list