Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] mpich2 complain about nodes that i dont use

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Ru-Zhen Li r.li at qmul.ac.uk
Fri Sep 30 06:58:42 PDT 2005


Dear all,

I am using mpich2 on linux cluster, I kept having errors like the following

rank 14 in job 2  cn128_57798   caused collective abort of all ranks
  exit status of rank 14: killed by signal 9

or

mpdrun_cn145: cannot connect to local mpd (/tmp/mpd2.console_lrz); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

there are 160 nodes on the cluster, I used "mpdboot -n -f" to initiate the mpi, and since there are always errors when i tried to boot every nodes, so i only defined 64 nodes in mpd.hosts file, and in the errors above, I dont have them in the mpd.hosts file or the command where i used my application (mpiexec command)

does anybody have any experience in this? Thanks a lot!

Best regards,

ruzhen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20050930/10cc766b/attachment.html


More information about the Beowulf mailing list