[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes

Mon Nov 16 21:04:07 PST 2009

Hi Gus,

On Mon, Nov 16, 2009 at 10:40:51PM -0500, Gus Correa wrote:
> Hi Martin
>
> I tried your program with the four combinations of
> IB and TCP/IP, mcmodel small and medium.
> I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium,
> just the program, hence this is not a very clean test.
>
> FYI, we have dual-socket quad-core AMD Opteron
> nodes with 16GB RAM each.
> OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4.

We have dual-socket quad-core Intel E5430, 16GB,
OpenMPI-1.3.3, SL 5.3, gcc 4.3.2 (and a bunch of other compilers,
but gcc-4.3.2 is used to compile OpenMPI), OFED-1.3.2 (tested
OFED-1.4.1 on two test nodes).

> When I ran on 2 nodes and 16 processes the program would always fail
> with segmentation fault / address not mapped on all four
> combinations above.
>
> However, when I ran on 2 nodes and 2 processes ( -bynode flag in
> use to direct each process to a separate node) then it
> worked over all four combinations!
>
> Here is the IB+medium stderr (you printed to stderr):
> id=1: calling irecv ...
> id=0: calling isend ...
>
> and the corresponding stdout:
> ...
> id=0: isend/irecv completed 1.954140
> id=1: isend/irecv completed 4.192037

Thanks!!
Now I am surprised ... this always fails here.
What's the difference?

> This rules out a problem with memory model, I suppose.
> Small is good enough for your message size,
> as long as there is enough RAM for all processes,
> MPI overhead, etc.
>
> Also, as Don Holmgren already pointed out to you,
> make sure your limits are properly set on the nodes.
> For instance, we use Torque, and we put these settings
> on the nodes' /etc/init.d/pbs_mom:
>
> ulimit -n 32768
> ulimit -s unlimited
> ulimit -l unlimited
>
> Just like Don, we've been burned by this before, when using the
> vendor original setup.
> Of course these limits can be set in other ways.

I have been running this on the two test nodes without going through
torque to avoid exactly these kind of problems.
Anyway, I just ran the same program through torque, ran "ulimit -a"
in the pbs script (all looks fine), but the program still fails.

> As a practical matter:
>
> Would it be possible/desirable to reduce the message size,
> splitting the huge message into several smaller ones?
> I know the wisdom is that one big message is better
> than many small ones, but here we're talking about huge,
> not big, and sizable, not small.
>
> Even your tiny test program takes a detectable time to run
> (4s+ seconds on IB, 14s+ on TCP/IP).
> It may be worth writing another version of it looping over
> smaller messages,
> and do some timing tests to compare with the huge
> message version.
> There may be a sweet spot for the message size vs. number of
> messages, I would guess.
> Big may not always be better.
>
> In the past a user here had a program sending very large messages
> (big 3D arrays).
> Not so big as to hit the 2GB threshold, but big enough to
> slow down the nodes and the cluster.
> Rewriting the program to loop over smaller messages
> (2D array slices) solved the problem.
> I remember other threads in the MPICH and OpenMPI
> mailing lists that reported difficulties with huge messages.
>
> My $0.02
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------

In principle, yes ... I already wrote wrapper functions
myMPI_Isend, myMPI_Irecv that do exactly that.
However, we are talking about one of those quantum chemistry
programs: many thousands of lines ... I'd really like to avoid
this.

- Martin

> Martin Siegert wrote:
>> Hi,
>>
>> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote:
>>> Hi Martin
>>>
>>> We didn't know which compiler you used.
>>> So what Michael sent you ("mmodel=memory_model")
>>> is the Intel compiler flag syntax.
>>> (PGI uses the same syntax, IIRR.)
>>
>> Now that was really stupid, I am using gcc-4.3.2 and even looked up
>> the correct syntax for the memory model, but nevertheless pasted the
>> Intel syntax into my configure script ... sorry.
>>
>>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
>>> I only used this with Intel ifort, hence I am not sure,
>>> but "medium" should work fine for large data/not-so-large program
>>> in gcc/gfortran.
>>> The "large" model doesn't seem to be implemented by gcc (4.1.2)
>>> anyway.
>>> (Maybe it is there in newer gcc versions.)
>>> The darn thing is that gcc says "medium" doesn't support building
>>> shared libraries,
>>> hence you may need to build OpenMPI static libraries instead,
>>> I would guess.
>>> (Again, check this if you have a newer gcc version.)
>>> Here's an excerpt of my gcc (4.1.2) man page:
>>>
>>>
>>>        -mcmodel=small
>>>             Generate code for the small code model: the program and its 
>>> symbols must be linked in the lower 2 GB of the address space.  Pointers 
>>> are 64 bits.  Pro-
>>>            grams can be statically or dynamically linked.  This is the 
>>> default code model.
>>>
>>>        -mcmodel=kernel
>>>            Generate code for the kernel code model.  The kernel runs in 
>>> the negative 2 GB of the address space.  This model has to be used for 
>>> Linux kernel code.
>>>
>>>        -mcmodel=medium
>>>            Generate code for the medium model: The program is linked in 
>>> the lower 2 GB of the address space but symbols can be located anywhere 
>>> in the address
>>>            space.  Programs can be statically or dynamically linked, but 
>>> building of shared libraries are not supported with the medium model.
>>>
>>>        -mcmodel=large
>>>            Generate code for the large model: This model makes no 
>>> assumptions about addresses and sizes of sections.  Currently GCC does 
>>> not implement this model.
>>
>> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program
>> still fails. The error message changes, however:
>>
>> id=1: calling irecv ...
>> id=0: calling isend ...
>> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e)
>> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046  vendor error 112 qp_idx 3
>>
>> (strerror(112) is "Host is down", which is certainly not correct).
>> This now points to system libraries - libmlx4. Am I correct in assuming that
>> this is either an OFED problem or OpenMPI exceeding some buffers in OFED
>> libraries without checking?
>>
>>> If you are using OpenMPI, "ompi-info -config"
>>> will tell the flags used to compile it.
>>> Mine is 1.3.2 and has no explicit mcmodel flag,
>>> which according to the gcc man page should default to "small".
>>
>> Are you - in fact, is anybody - able to run my test program? I am
>> hoping that there is some stupid misconfiguration on the cluster
>> that can be fixed easily, without reinstalling/recompiling all
>> apps ...
>>
>>> Are you using 16GB per process or for the whole set of processes?
>>
>> I am running the two processes on different nodes (and nothing else
>> on the nodes), thus each process has the full 16GB available.
>>> I hope this helps,
>>> Gus Correa
>>> ---------------------------------------------------------------------
>>> Gustavo Correa
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>
>> Thanks!
>>
>> - Martin
>>
>>> Martin Siegert wrote:
>>>> Hi Michael,
>>>>
>>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>>>>> Martin,
>>>>>
>>>>> Could it be that your MPI library was compiled using a small memory 
>>>>> model?  The 180 million doubles sounds suspiciously close to a 2 GB 
>>>>> addressing limit.
>>>>>
>>>>> This issue came up on the list recently under the topic "Fortran Array 
>>>>> size question."
>>>>>
>>>>>
>>>>> Mike
>>>> I am running MPI applications that use more than 16GB of memory - I do 
>>>> not believe that this is the problem. Also -mmodel=large
>>>> does not appear to be a valid argument for gcc under x86_64:
>>>> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
>>>> cc1: error: unrecognized command line option "-mmodel=large"
>>>>
>>>> - Martin
>>>>
>>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I am running into problems when sending large messages (about
>>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>>>>
>>>>>> # mpicc -g sendrecv.c
>>>>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>>>>> id=1: calling irecv ...
>>>>>> id=0: calling isend ...
>>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for 
>>>>>> wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>>>>
>>>>>> This is with OpenMPI-1.3.3.
>>>>>> Does anybody know a solution to this problem?
>>>>>>
>>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>>>>> and never returns.
>>>>>> I asked on the openmpi users list but got no response ...
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>>
>>>>>> --
>>>>>> Martin Siegert
>>>>>> Head, Research Computing
>>>>>> WestGrid Site Lead
>>>>>> IT Services                                phone: 778 782-4691
>>>>>> Simon Fraser University                    fax:   778 782-4242
>>>>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>>>>> Canada  V5A 1S6
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6