Problems with MPICH 1.2 and Beowulf/Linux

Francesco Marini marini at pcmenelao.mi.infn.it
Tue Jun 6 06:55:21 PDT 2000


Hi all,

    I've got a really weird problem with MPICH 1.2.
    The system consists of a server and 16 computing nodes, all
diskless, mounting root via NFS from the server.  It works very well
with pvm and LAM-MPI.
    Now, I'm trying to compile the latest source of MPICH, the make
process goes well, but when I try to "make testing" I get this output
(repeated for all tests using more than 1 machine) :

*** Testing MPI_Test ***
pcwalhalla : Mon May 29 16:27:09 CEST 2000
/work/staff/marini/mpich-1.2.0/bin/mpicc -DUSE_SOCKLEN_T
-DUSE_U_INT_FOR_XDR -DFORTRANUNDERSCORE -DHAVE_MPICHCONF_H
-DHAVE_STDLIB_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1
-DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1
-DHAVE_SIGACTION=1   -c persistent.c
/work/staff/marini/mpich-1.2.0/bin/mpicc  -o persistent persistent.o
*** Testing MPI_Recv_init ***
Differences in persistent.out
2,5c2,8
< rm_3383:  p4_error: rm_start: net_conn_to_listener failed: 3165
< p0_20161:  p4_error: Timeout in making connection to remote process on
node1: 0
< bm_list_20162:  p4_error: interrupt SIGINT: 2
< rm_l_1_20168:  p4_error: interrupt SIGINT: 2
---
> Receiving message 1
> Received message 1
> Receiving message 2
> Received message 2
> Receiving message 3
> Received message 3
> Completed all receives
7d9
< rm_20167:  p4_error: interrupt SIGINT: 2
pcwalhalla : Mon May 29 16:32:12 CEST 2000
/work/staff/marini/mpich-1.2.0/bin/mpicc -DUSE_SOCKLEN_T
-DUSE_U_INT_FOR_XDR -DFORTRANUNDERSCORE -DHAVE_MPICHCONF_H
-DHAVE_STDLIB_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1
-DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1
-DHAVE_SIGACTION=1   -c persist.c
/work/staff/marini/mpich-1.2.0/bin/mpicc  -o persist persist.o
*** Testing MPI_Startall/Request_free ***
Differences in persist.out
2,5c2
< rm_3388:  p4_error: rm_start: net_conn_to_listener failed: 3171
< p0_20318:  p4_error: Timeout in making connection to remote process on
node1: 0
< bm_list_20319:  p4_error: interrupt SIGINT: 2
< rm_l_1_20325:  p4_error: interrupt SIGINT: 2
---
> No errors
7d3
< rm_20324:  p4_error: interrupt SIGINT: 2
pcwalhalla : Mon May 29 16:37:14 CEST 2000
/work/staff/marini/mpich-1.2.0/bin/mpicc -DUSE_SOCKLEN_T
-DUSE_U_INT_FOR_XDR -DFORTRANUNDERSCORE -DHAVE_MPICHCONF_H
-DHAVE_STDLIB_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1
-DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1
-DHAVE_SIGACTION=1   -c persist2.c
/work/staff/marini/mpich-1.2.0/bin/mpicc  -o persist2 persist2.o
*** Testing MPI_Startall(Bsend)/Request_free ***
Differences in persist2.out
2,5c2
< rm_3391:  p4_error: rm_start: net_conn_to_listener failed: 3177
< p0_20473:  p4_error: Timeout in making connection to remote process on
node1: 0
< bm_list_20474:  p4_error: interrupt SIGINT: 2
< rm_l_1_20480:  p4_error: interrupt SIGINT: 2
---

    Seems like MPICH cannot start the remote process or cannot establish
the connection. The crazy thing is that with pvm and LAM-MPI all goes
well.

    Any idea ?

    Second : I've got some prob compiling ScaLapack with LAM-MPI, gcc
and pgf77 (f77 compiler from Portland Group), it gives a lot of
unresolved symbols regarding MPI. Anyone succeded in compiling them
under same configuration ?

    Thank you all in advance,


Franz Marini


---------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : marini at pcmenelao.mi.infn.it
---------------------------------------------






More information about the Beowulf mailing list