[eepro100] eepro100 channel bonding and running MPICH problem

Wed, 13 Jun 2001 12:28:12 -0400

Hi,
	I am running a small test cluster of 1Ghz Athlon computers channel bonded
using Intel EtherExpress Pro+ (82559) network cards.  I installed RedHat 6.2
and then upgraded the kernel (using packages) to 2.2.16-22.  I installed the
v1.13 driver from www.scyld.com and disregarding some warnings about
redefined definitions it compiled/installed perfectly.  I also got the e100
driver from www.intel.com to test that as well.
        When channel bonded the network performs fine using the eepro100
driver from scyld.com but using the intel provided driver the performance is
actually worse than with a single channel.  The netperf scores are shown
below:

			eepro100(v1.13)		e100(Intel v1.6.6)
single NIC		94.02 Mbits/s		95.3 Mbits/s
double NIC		187.93 Mbits/s		89.7 Mbits/s

	I run a plane wave geometry optimization on 32 water molecules using CPMD
as a test for
performance.  This test utilizes the ALL to ALL function of MPI extensively
(hence are hopes channel bonding would improve performance).  When this test
is run using the MPICH implementation of MPI the program will run for
approx. 5-8 minutes and then stall.  By stall I mean, the program continues
to be active and use 95% of the cpu time, but doesn't output any more
information and doesn't generate network traffic.  Eventually the program
will report an error regarding communication (see below).  I have tested the
same exact input and program using the LAM implementation of MPI and it runs
fine, finishing slightly faster than the non-channel bonded time.

Error Message:
net_recv failed for fd = 7
p2_730:  p4_error: net_recv read, errno = : 110
rm_l_2_731:  p4_error: interrupt SIGINT: 2
p1_739: (rm_l_1_740:  p4_error: net_recv read:  probable EOF on socket: 1
bm_list_19021:  p4_error: net_recv read:  probable EOF on socket: 1

	There are no errors reported by the network driver, except that a few
transmit overruns occur with LAM but not with MPICH.  In fact MPICH doesn't
generate any overruns at all and still fails.

Network info (ifconfig output after a succesful run using LAM):
bond0     Link encap:Ethernet  HWaddr 00:02:B3:1C:D5:32
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:1816298 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1886586 errors:0 dropped:0 overruns:63 carrier:0
          collisions:0 txqueuelen:0

eth0      Link encap:Ethernet  HWaddr 00:02:B3:1C:D5:32
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:908186 errors:0 dropped:0 overruns:0 frame:0
          TX packets:943293 errors:0 dropped:0 overruns:32 carrier:0
          collisions:0 txqueuelen:100
          Interrupt:5 Base address:0xa000

eth1      Link encap:Ethernet  HWaddr 00:02:B3:1C:D5:32
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:908112 errors:0 dropped:0 overruns:0 frame:0
          TX packets:943293 errors:0 dropped:0 overruns:31 carrier:0
          collisions:0 txqueuelen:100
          Interrupt:10 Base address:0xc000

	Perhaps someone can help me diagnose this problem, or at least indicate
some tests that might give a better description of what is happening.  It
seems like CPMD using MPICH runs into some network errors and can't recover
from them, where as CPMD using LAM is a bit more robust.

Gordon Gere

University of Western Ontario
(519)661-2111 ext. 86353
London, Ontario