MPICH hangs with dual cpu computers when np is using all cpus

Thu Aug 2 10:52:27 PDT 2001

I build many different kinds of Beowulf clusters and have noticed a
trend of increased problems on clusters using MPICH and dual cpu
computers.   The problem(s) presents itself this way:  MPI programs with
an np parameter asking for the total number of cpus (i.e. for 20 dual
cpu computers, using  -np 40) will run for several hours and then hang. 
The errors are along the lines of:

p4_error Timeout establishing connection to remote process:0 interrupt
SIGINT: 2

or

Trying to receive a message when there are no connections.

A simple restart of the job starts up ok and continues because network
and rsh connectivity is still working fine.  You can ping and rsh every
node.

A more dramatic example is that one of the nodes will no longer be
rsh-able and have to be restarted.  In that case, the error messages are
more like:

p4_error net_recv read: probable eof on socket:

I have dones tests using different switches, different ethernet
(Kingstons, Intel Etherpro, etc), different device drivers, different
motherboards.  The permutations seem endless.  I have tested a variety
of 2.4+ kernels and gnu and PGI compilers.  In all cases I am using
REDHAT 7.0 and rsh.  In all cases, the MPI source is version 1.2.0.  In
all cases, using an np amount equal to half the number of cpus instead
(i.e. for 20 dual cpu computers, using -np 20), there are NO problems at
all.  In all cases, the computers pass rigorous stress testing and tests
on the network itself fail to show a problem.  I have used the Portland
Group compilers to build MPI 1.2.0 from scratch as well as using their
canned 3.2-4.  MPICH was built on a LINUX 6.2 system.  I haven't tried a
fresh MPI build yet.

To reproduce this problem, I am using the standard MPI examples:  cpi
and srtest, and executing them over and over again in a script.  
Smaller number of computers (5-15 computers) work fine.  More than that
will fail eventually, and generally, the higher the quantity of
computers in the test, the more quickly it will fail.

The machines list is very simple.  It is a list of every computer for
the test, listed once.  

I will build MPI 1.2.1 to see how it effects the situation but fail to
understand what is different now than previously when I had dozens of
clusters running without fail on all sorts of combinations.  This all
seemed to start getting worse upon the start of using RED HAT 7.0 but
that could be a coincidence.  

Any ideas?

Nina