[Beowulf] mpich2 complain about nodes that i dont use

Martin Siegert siegert at sfu.ca
Fri Sep 30 19:37:41 PDT 2005


On Fri, Sep 30, 2005 at 09:47:46PM -0400, Mark Hahn wrote:
> > I am using mpich2 on linux cluster, I kept having errors like the following
> > 
> > rank 14 in job 2  cn128_57798   caused collective abort of all ranks
> >   exit status of rank 14: killed by signal 9
> 
> signal 9 is sigkill (not segv or abrt, etc), and I'd be a bit surprised
> if this happened other than by someone killing the process.

I indeed was surprised when I saw that (signal 9) with one of our codes
as well. In that case it turned out to be code that needed a larger
stacksize than was permitted under the current settings (ulimit, etc.).
Thus, if "ulimit -s" shows something like 8192 you may want to increase
that and try again.
I could imagine that something like this could also happen with code
that has a memory leak and runs the system out of memory.

- Martin

-- 
Martin Siegert
Head, HPC at SFU
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6



More information about the Beowulf mailing list