Mpich 1.2.3 first run problem

Felix Rauch rauch at inf.ethz.ch
Tue Sep 17 08:40:39 PDT 2002


On Mon, 16 Sep 2002, Jim Matthews wrote:
> I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as
> well from someone else who tested it).  The problem occurs immediately
> following a reboot of the cluster nodes.  What happens is that if I try
> to run a job on 16 processors, for example, the job will hang and never
> start when an mpirun is invoked.  The solution is to start with a 2
> processor job, which will always work, and from there go to a 3
> processor job working my way up to 16 processors.  Once that is done the
> job will run on all 16 processors (or however many) and continue to run
> and be re-run, with long periods of interruption, until the cluster is
> reboot at which time the problem will once again surface.

We had a similar problem once when we installed mpich to compare it
with Score. The problem was that mpich didn't work correctly when
there were two mpich-jobs on different machines but with identical
process identifiers (PIDs). To solve the problem we wrote a little
script that logged in onto all the nodes and started a different
number of small jobs on the nodes (e.g. node_number * 100 "echo"s).

- Felix
-- 
Felix Rauch                      | Email: rauch at inf.ethz.ch
Institute for Computer Systems   | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18             | Phone: +41 1 632 7489
CH - 8092 Zuerich / Switzerland  | Fax:   +41 1 632 1307




More information about the Beowulf mailing list