[Beowulf] IB problem/using IB diagnostics

Gus Correa gus at ldeo.columbia.edu
Fri Jun 19 09:14:52 PDT 2009


Prentice Bisbal wrote:
> John Hearns wrote:
>>
>> 2009/6/18 Prentice Bisbal <prentice at ias.edu <mailto:prentice at ias.edu>>
>>
>>     John Hearns wrote:
>>     > Can you log into node36 and run ibstat or ibstatus?
>>     >
>>
>> Looks good to me!
>> Links are up and it sees a subnet manager. As Greg says, looks like
>> something wonky in the script which is reporting
>> the node status??
> 
> It's actually an MPI job (HPL using OpenMPI) which is reporting the
> problem.
> 
> The head scratching continues...
> 

Hi Prentice, list

Just in case you haven't seen this ...
Are you using OpenMPI 1.3.0 or 1.3.1?
Those versions have a memory leak bug when using IB.
The solution for the memory leak is to upgrade to 1.3.2.
A workaround is to use -mca mpi_leave_pinned=0.
See:

http://www.open-mpi.org/community/lists/announce/2009/04/0030.php
https://svn.open-mpi.org/trac/ompi/ticket/1853

My HPL with OpenMPI 1.3.1 crashed when using lots of memory.
I upgraded to 1.3.2, which fixed the problem,
and I haven't looked at the error messages,
so your problem may be different.
However, memory leaks can produce weird errors, hard to diagnose.

My $0.02.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------



More information about the Beowulf mailing list