mpi-prog porting from lam -> scyld beowulf mpi difficulties

Peter Beerli beerli at genetics.washington.edu
Wed Nov 28 17:03:46 PST 2001


Hi,
I have a program developed using MPI-1 under LAM.
It runs fine on several LAM-MPI clusters with different architecture.
A user wants to run it on a Scyld-beowulf cluster and there it fails.
I did a few tests myself and it seems
that the program stalls if run on more than 3 nodes, but seems to work for
2-3 nodes. The program has master-slaves architectures where the master
is mostly doing nothing. There are some reports sent to stdout from any node
(but this seems to work in beompi the same way as in LAM). 
There are several things unclear to me
because I have no clue about the beompi system, beowulf and scyld in
particular.

(1) if I run "top" why do I see 6 processes running when I start
    with mpirun -np 3 migrate-n ? 

(2) The data-phase stalls on the slave nodes.
    The master node is reading the data from a file and then broadcasts
    a large char buffer to the slaves. Is this wrong, is there a better way
    to do that [I do not know how big the data is and it is a complex mix
    of strings numbers etc.]

void
broadcast_data_master (data_fmt * data, option_fmt * options)
{
  long bufsize;
  char *buffer;
  buffer = (char *) calloc (1, sizeof (char));
  bufsize = pack_databuffer (&buffer, data, options);
  MPI_Bcast (&bufsize, 1, MPI_LONG, MASTER, comm_world);
  MPI_Bcast (buffer, bufsize, MPI_CHAR, MASTER, comm_world);
  free (buffer);
}

void
broadcast_data_worker (data_fmt * data, option_fmt * options)
{
  long bufsize;
  char *buffer;
  MPI_Bcast (&bufsize, 1, MPI_LONG, MASTER, comm_world);
  buffer = (char *) calloc (bufsize, sizeof (char));
  MPI_Bcast (buffer, bufsize, MPI_CHAR, MASTER, comm_world);
  unpack_databuffer (buffer, data, options);
  free (buffer);
}

  the master and the first node seem to read the data fine
   but the others either don't and wait or silently die.
   
(3) what is the easiest way to debug this? With LAM I just attached to pids to
    in gdb on the different nodes, but here the nodes are transparent to me
    [but as I said I have never used a beowulf cluster before].


Can you give pointers, hints

thanks
Peter
-- 
Peter Beerli,  Genome Sciences, Box #357730, University of Washington,
Seattle WA 98195-7730 USA, Ph:2065438751, Fax:2065430754
http://evolution.genetics.washington.edu/PBhtmls/beerli.html






More information about the Beowulf mailing list