Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Don Holmgren djholm at fnal.gov
Mon Nov 16 14:24:27 PST 2009


Be careful - ulimit's can differ between an interative shell launched with
rsh/ssh, an interactive batch shell launched with "qsub -I" and the like, the 
environment of your batch script, and the environment of the processes launched
via mpirun.  I've been burned by this before.

If you are using a TM-based launch, for example (openmpi or OSU mpiexec), the
ulimit environment on a PBS/Torque batch setup will be governed by the ulimits
of pbs_mom, which in turn is governed by your init process and/or by any of
the ulimit commands in init.d/pbs-client.

The only way to be sure of a particuar ulimit is to to a "get_rlimits()" call in 
your mpi-launched binary and check the size.

Chances are this isn't your problem, though, because usually the error messages
make it pretty clear that a memory lock failure has occurred.

Don Holmgren
Fermilab




On Mon, 16 Nov 2009, Martin Siegert wrote:

> Hi Mark,
>
> On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote:
>>> I am running into problems when sending large messages (about
>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>
>> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
>> set too low?  (ulimit -l)
>
> Good point.
> By now I have played with all kinds of ulimits (the nodes have 16GB
> of memory and 16GB of swap space - this program is not even coming close
> to those limits). This is the current setting:
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 139264
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) unlimited
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 139264
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> ... same error :-(
>
>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>
>> 105 looks like it might be an errno to me:
>> #define ENOBUFS         105     /* No buffer space available */
>>
>> regards, mark.
>
> BTW: when using Intel-MPI (MPICH2) the program segfaults with
> l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to
> transfer the data internally and multiply the variable count by 8
> without checking whether the integer overflows ...
>
> - Martin



More information about the Beowulf mailing list