Mpich 1.2.3 first run problem

William Gropp gropp at mcs.anl.gov
Tue Sep 17 11:04:41 PDT 2002


At 05:40 PM 9/17/2002 +0200, Felix Rauch wrote:
>On Mon, 16 Sep 2002, Jim Matthews wrote:
> > I have been seeing a very strange problem with mpich 1.2.3 (and 1.2.4 as
> > well from someone else who tested it).  The problem occurs immediately
> > following a reboot of the cluster nodes.  What happens is that if I try
> > to run a job on 16 processors, for example, the job will hang and never
> > start when an mpirun is invoked.  The solution is to start with a 2
> > processor job, which will always work, and from there go to a 3
> > processor job working my way up to 16 processors.  Once that is done the
> > job will run on all 16 processors (or however many) and continue to run
> > and be re-run, with long periods of interruption, until the cluster is
> > reboot at which time the problem will once again surface.
>
>We had a similar problem once when we installed mpich to compare it
>with Score. The problem was that mpich didn't work correctly when
>there were two mpich-jobs on different machines but with identical
>process identifiers (PIDs). To solve the problem we wrote a little
>script that logged in onto all the nodes and started a different
>number of small jobs on the nodes (e.g. node_number * 100 "echo"s).

Thanks!  That points us to the code that must be broken.  We'll have a fix 
in a few days.

Bill




More information about the Beowulf mailing list