Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] mpich2 complain about nodes that i dont use

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Ru-Zhen Li r.li at qmul.ac.uk
Sat Oct 1 03:43:32 PDT 2005


Dear Martin and Mark,

Thanks for the reply, however, the interesting thing is i used nearly 
exactly the same input files for my application, and the first one which i 
submitted several days ago is fine, no errors.

also, when I used ulimit -s, it turns out to be 10240, because the cluster 
is not very stable, for e.g., when i use mpdboot for every nodes before, it 
doesnt have any error...... so I am thinking it might be caused by the 
communicating problem between then nodes.......but i am not sure.....

Thanks again.




Yesterday is history, tomorrow is mystery, only today is a gift, that's why 
we call it present !

========================================================================
Ru-Zhen Li

0044 020 7882 6327
Materials Department
Queen Mary
University of London
E1 4NS

Email: r.li at qmul.ac.uk
Homepage: http://www.freewebs.com/lrz/
----- Original Message ----- 
From: "Martin Siegert" <siegert at sfu.ca>
To: "Mark Hahn" <hahn at physics.mcmaster.ca>
Cc: "Ru-Zhen Li" <r.li at qmul.ac.uk>; <beowulf at beowulf.org>
Sent: Saturday, October 01, 2005 3:37 AM
Subject: Re: [Beowulf] mpich2 complain about nodes that i dont use


> On Fri, Sep 30, 2005 at 09:47:46PM -0400, Mark Hahn wrote:
>> > I am using mpich2 on linux cluster, I kept having errors like the 
>> > following
>> >
>> > rank 14 in job 2  cn128_57798   caused collective abort of all ranks
>> >   exit status of rank 14: killed by signal 9
>>
>> signal 9 is sigkill (not segv or abrt, etc), and I'd be a bit surprised
>> if this happened other than by someone killing the process.
>
> I indeed was surprised when I saw that (signal 9) with one of our codes
> as well. In that case it turned out to be code that needed a larger
> stacksize than was permitted under the current settings (ulimit, etc.).
> Thus, if "ulimit -s" shows something like 8192 you may want to increase
> that and try again.
> I could imagine that something like this could also happen with code
> that has a memory leak and runs the system out of memory.
>
> - Martin
>
> -- 
> Martin Siegert
> Head, HPC at SFU
> WestGrid Site Manager
> Academic Computing Services                        phone: (604) 291-4691
> Simon Fraser University                            fax:   (604) 291-4242
> Burnaby, British Columbia                          email: siegert at sfu.ca
> Canada  V5A 1S6
> 




More information about the Beowulf mailing list