Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Martin Siegert siegert at sfu.ca
Mon Nov 16 13:24:50 PST 2009


Hi Mark,

On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote:
>> I am running into problems when sending large messages (about
>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>
> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
> set too low?  (ulimit -l)

Good point.
By now I have played with all kinds of ulimits (the nodes have 16GB
of memory and 16GB of swap space - this program is not even coming close
to those limits). This is the current setting:
# ulimit -a
core file size          (blocks, -c) 0                            
data seg size           (kbytes, -d) unlimited                    
scheduling priority             (-e) 0                            
file size               (blocks, -f) unlimited                    
pending signals                 (-i) 139264
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

... same error :-(

>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>
> 105 looks like it might be an errno to me:
> #define ENOBUFS         105     /* No buffer space available */
>
> regards, mark.

BTW: when using Intel-MPI (MPICH2) the program segfaults with
l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to
transfer the data internally and multiply the variable count by 8
without checking whether the integer overflows ...

- Martin



More information about the Beowulf mailing list