Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] mpich2 complain about nodes that i dont use

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Martin Siegert siegert at sfu.ca
Fri Sep 30 19:37:41 PDT 2005


On Fri, Sep 30, 2005 at 09:47:46PM -0400, Mark Hahn wrote:
> > I am using mpich2 on linux cluster, I kept having errors like the following
> > 
> > rank 14 in job 2  cn128_57798   caused collective abort of all ranks
> >   exit status of rank 14: killed by signal 9
> 
> signal 9 is sigkill (not segv or abrt, etc), and I'd be a bit surprised
> if this happened other than by someone killing the process.

I indeed was surprised when I saw that (signal 9) with one of our codes
as well. In that case it turned out to be code that needed a larger
stacksize than was permitted under the current settings (ulimit, etc.).
Thus, if "ulimit -s" shows something like 8192 you may want to increase
that and try again.
I could imagine that something like this could also happen with code
that has a memory leak and runs the system out of memory.

- Martin

-- 
Martin Siegert
Head, HPC at SFU
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6



More information about the Beowulf mailing list