[Beowulf] mpich2 complain about nodes that i dont use

Ru-Zhen Li r.li at qmul.ac.uk
Fri Sep 30 06:58:42 PDT 2005


Dear all,

I am using mpich2 on linux cluster, I kept having errors like the following

rank 14 in job 2  cn128_57798   caused collective abort of all ranks
  exit status of rank 14: killed by signal 9

or

mpdrun_cn145: cannot connect to local mpd (/tmp/mpd2.console_lrz); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

there are 160 nodes on the cluster, I used "mpdboot -n -f" to initiate the mpi, and since there are always errors when i tried to boot every nodes, so i only defined 64 nodes in mpd.hosts file, and in the errors above, I dont have them in the mpd.hosts file or the command where i used my application (mpiexec command)

does anybody have any experience in this? Thanks a lot!

Best regards,

ruzhen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050930/10cc766b/attachment.html>


More information about the Beowulf mailing list