AGAIN: mpi-prog from lam -> scyld beompi DIES
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Peter Beerli beerli at genetics.washington.eduSat Dec 8 12:36:25 PST 2001
- Previous message: I've got 8 linux boxes, what now
- Next message: AGAIN: mpi-prog from lam -> scyld beompi DIES
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Some time ago I asked about some problem with my mpi program and a scyld
beowulf cluster and got no real response to it.
- did nobody every port a lam-mpi program onto a scyld-beowulf cluster?
- did I miss the right keywords or what information is missing??
any hints? I add my post again.
Peter
On Wed, 28 Nov 2001, Peter Beerli wrote:
> Hi,
> I have a program developed using MPI-1 under LAM.
> It runs fine on several LAM-MPI clusters with different architecture.
> A user wants to run it on a Scyld-beowulf cluster and there it fails.
> I did a few tests myself and it seems
> that the program stalls if run on more than 3 nodes, but seems to work for
> 2-3 nodes. The program has master-slaves architectures where the master
> is mostly doing nothing. There are some reports sent to stdout from any node
> (but this seems to work in beompi the same way as in LAM).
> There are several things unclear to me
> because I have no clue about the beompi system, beowulf and scyld in
> particular.
>
> (1) if I run "top" why do I see 6 processes running when I start
> with mpirun -np 3 migrate-n ?
here I received a useful response, but this does not solve my problem.
this is solved, and is just they way how mpich treats run and I/O,
but they these different process have different mpi-IDs? then this would
be a problem.
>
> (2) The data-phase stalls on the slave nodes.
> The master node is reading the data from a file and then broadcasts
> a large char buffer to the slaves. Is this wrong, is there a better way
> to do that [I do not know how big the data is and it is a complex mix
> of strings numbers etc.]
>
> void
> broadcast_data_master (data_fmt * data, option_fmt * options)
> {
> long bufsize;
> char *buffer;
> buffer = (char *) calloc (1, sizeof (char));
> bufsize = pack_databuffer (&buffer, data, options);
> MPI_Bcast (&bufsize, 1, MPI_LONG, MASTER, comm_world);
> MPI_Bcast (buffer, bufsize, MPI_CHAR, MASTER, comm_world);
> free (buffer);
> }
In case you wonder about the size of the buffer, it gets expanded
in pack_databuffer()
>
> void
> broadcast_data_worker (data_fmt * data, option_fmt * options)
> {
> long bufsize;
> char *buffer;
> MPI_Bcast (&bufsize, 1, MPI_LONG, MASTER, comm_world);
> buffer = (char *) calloc (bufsize, sizeof (char));
> MPI_Bcast (buffer, bufsize, MPI_CHAR, MASTER, comm_world);
> unpack_databuffer (buffer, data, options);
> free (buffer);
> }
>
> the master and the first node seem to read the data fine
> but the others either don't and wait or silently die.
>
> (3) what is the easiest way to debug this? With LAM I just attached to pids to
> in gdb on the different nodes, but here the nodes are transparent to me
> [but as I said I have never used a beowulf cluster before].
>
>
> Can you give pointers, hints
>
> thanks
> Peter
>
--
Peter Beerli, Genome Sciences, Box #357730, University of Washington,
Seattle WA 98195-7730 USA, Ph:2065438751, Fax:2065430754
http://evolution.genetics.washington.edu/PBhtmls/beerli.html
- Previous message: I've got 8 linux boxes, what now
- Next message: AGAIN: mpi-prog from lam -> scyld beompi DIES
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
