MPICH Problem on a Channel Bonded Mini-Cluster

Tue Jun 12 10:42:24 PDT 2001

Hi,
        I am running a small test cluster of 4 1Ghz Athlon computers
channel bonded using Intel EtherExpress Pro+ (82559) network cards.  I
will quickly sum up my problem and post all of the related information at
the bottom of this message.
        When channel bonded the computers perform fine, and netperf shows
a roughly 95% improvement in network bandwidth.  I run a plane wave
geometry optimization on 32 water molecules using CPMD as a test for
performance.  This test utilizes the ALL to ALL function of MPI
extensively (hence are hopes channel bonding would improve performance).  
When this test is run using the MPICH implementation of MPI the program
will run for approx. 5-8 minutes and then stall.  By stall I mean, the
program continues to be active and use 95% of the cpu time, but doesn't
output any more information and doesn't generate network traffic.  
Eventually the program will report an error regarding communication (see
below).  I have tested the same exact input and program using the LAM
implementation of MPI and it runs fine, finishing slightly faster than the
non-channel bonded time.  We have found using single channel networking
that LAM is much faster if only using 2-6 nodes but starts losing
performance over 8 (see performance numbers at end) and MPICH although
slower than LAM at 2-6 nodes continues to scale well up to 14 nodes.  We
wish to use MPICH if possible.  I hope someone can shed some light on why
our test doesn't finish using MPICH but does using LAM.

Error Message:
net_recv failed for fd = 7
p2_730:  p4_error: net_recv read, errno = : 110
rm_l_2_731:  p4_error: interrupt SIGINT: 2
p1_739: (rm_l_1_740:  p4_error: net_recv read:  probable EOF on socket: 1
bm_list_19021:  p4_error: net_recv read:  probable EOF on socket: 1

System info:
Running RedHat 6.2 but upgraded the kernel to 2.2.16-22.

Network info (ifconfig output):
bond0     Link encap:Ethernet  HWaddr 00:02:B3:1C:D5:32
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:78 errors:0 dropped:0 overruns:0 frame:0
          TX packets:33 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0

eth0      Link encap:Ethernet  HWaddr 00:02:B3:1C:D5:32
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:39 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          Interrupt:5 Base address:0xa000

eth1      Link encap:Ethernet  HWaddr 00:02:B3:1C:D5:32
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:39 errors:0 dropped:0 overruns:0 frame:0
          TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          Interrupt:10 Base address:0xc000

I bonded using the latest drivers (v1.12) from ftp.scyld.com, using the
following commands:

modprobe bonding
ifconfig bond0 192.168.0.1 netmask 255.255.255.0 up
ifenslave bond0 eth0
ifenslave bond0 eth1
(and made changes to /etc/sysconfig/network-scripts/ifcfg-*)

MPICH/LAM installation: Compiled and installed MPICH and LAM using the
Portland Group Compilers.

Results:
NetPerf results on eepro100 NICs:
singe NIC	94.02 Mbits/s
bonded NICs	187.93 Mbits/s

These results are per step in seconds using CPMD with 4 nodes:
		LAM		MPICH
single NIC	90s		119s
bonded NICs	80s		error (88s)*

* Although MPICH with channel bonding was able to complete 1-2 steps it
was not able to finish.  I leave the numbers in for reference sake.

Results for single channel CPMD runs on the same input for LAM and MPICH
per number of nodes:
		LAM		MPICH
1 cpu		217s		217s
2 cpu		117s		175s
4 cpu		84s		117s
6 cpu 		75s		98s
8 cpu		114s		60s
12 cpu		120s		48s
14 cpu		n/a		43s

	Since the only time an error occurs is when I use channel bonding
it is obvious that something related to the network configuration is the
problem.  I have thought of a couple possible problems and solutions and
was wondering if people wouldn't mind offering advice on them.
	First, I have noticed that some people have found problems with
the bonding.o module from the 2.2.16 kernel and suggested using the
bonding.o from 2.2.17, however I have not been able to find such a module.  
I have thought about upgrading (again) to the 2.2.19 kernel available from
ftp.redhat.com, but havn't heard anything about it yet.
        Since the error generated by the program is rather oblique it is
hard to diagnose the problem.  Some related errors I get are transmit and
receive overruns (from ifconfig) on the interfaces used to communicate to
the cluster.  This points me towards some sort of network problem, however
I get the same receive and transmit overruns when running LAM which still
finishes.  This leads me to think that LAM has been implemented to be more
stable and/or MPICH runs into some sort of problem handling the
overruns.  Perhaps the bonding.o module is the problem or the network
cards/drivers under the bonding situation fail without giving errors.

Thanks in advance.

Gordon Gere

University of Western Ontario
(519)661-2111 ext. 86353
London, Ontario