[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Don Holmgren djholm at fnal.govMon Nov 16 14:24:27 PST 2009
- Previous message: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
- Next message: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Be careful - ulimit's can differ between an interative shell launched with rsh/ssh, an interactive batch shell launched with "qsub -I" and the like, the environment of your batch script, and the environment of the processes launched via mpirun. I've been burned by this before. If you are using a TM-based launch, for example (openmpi or OSU mpiexec), the ulimit environment on a PBS/Torque batch setup will be governed by the ulimits of pbs_mom, which in turn is governed by your init process and/or by any of the ulimit commands in init.d/pbs-client. The only way to be sure of a particuar ulimit is to to a "get_rlimits()" call in your mpi-launched binary and check the size. Chances are this isn't your problem, though, because usually the error messages make it pretty clear that a memory lock failure has occurred. Don Holmgren Fermilab On Mon, 16 Nov 2009, Martin Siegert wrote: > Hi Mark, > > On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote: >>> I am running into problems when sending large messages (about >>> 180000000 doubles) over IB. A fairly trivial example program is attached. >> >> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK >> set too low? (ulimit -l) > > Good point. > By now I have played with all kinds of ulimits (the nodes have 16GB > of memory and 16GB of swap space - this program is not even coming close > to those limits). This is the current setting: > # ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 139264 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) unlimited > real-time priority (-r) 0 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 139264 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > ... same error :-( > >>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 >> >> 105 looks like it might be an errno to me: >> #define ENOBUFS 105 /* No buffer space available */ >> >> regards, mark. > > BTW: when using Intel-MPI (MPICH2) the program segfaults with > l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to > transfer the data internally and multiply the variable count by 8 > without checking whether the integer overflows ... > > - Martin
- Previous message: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
- Next message: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
