[Beowulf] Newbie question on mpich2 installation

Saleem Hasan hasan at grant.phys.subr.edu
Mon Feb 7 18:15:25 PST 2005


Hello all,

I apologise for what may be a very simple issue but is giving me trouble.
I would really appreciate some advice.

For learning the setup of a cluster, I have installed mpich2 on a linux
machine with Red Hat 8.0. I have a second machine RH 8.0. w2 is the master
and w1 is the slave. I have installed mpich2 on w2 (/home/mpi) and used
nfs to share /home with w1. I have also setup passwordless ssh between w1
and w2. 

I am able to bring up mpd on the local machine (w2) and do mpdtrace and
mpdallexit. I am following the installation procedure from the MPICH2
home.

I am unable to boot mpd on the slave. The first time I ran
mpdboot -n 2 -f /home/mpi/mpd.hosts, 
I got the message that there was no mpd.conf file in w1 and that could be
a reason for the mpd not coming up the slave. I added an mpd.conf
(secretword) to /etc in the slave also. Now I get a different message

[root at w2 mpich2-1.0]# mpdboot -n 2 -f /home/mpi/mpd.hosts
mpdboot_w2.maverick.net_0 (mpdboot 357): error trying to start mpd(boot)
at 1 w1.maverick.net; output:
mpdboot_w1_1 (err_exit 379): mpd failed to start correctly on w1
  reason: 1: invalid msg from mpd :{}:
mpdboot_w1_1 (err_exit 385):   contents of mpd logfile in /tmp:
     logfile for mpd with pid 1654
mpdboot_w2.maverick.net_0 (err_exit 379): mpd failed to start correctly on
w2.maverick.net

Even though the message says mpd failed to start coorectly on w2 (last
line), mpdtrace gives w2. 

The log file in w1 (slave) states the following
logfile for mpd with pid 1654
w1_1060 failed ; cause: unable to obtain socket for rhs in ring
    traceback: [('/home/mpi/mpich2-install/bin/mpd.py', '1192',
'_enter_existing_ring'), ('/home/mpi/mpich2-install/bin/mpd.py', '173',
'_mpd_init'), ('/home/mpi/mpich2-install/bin/mpd.py', '1374', '?')]

Thank you very much.

Saleem Hasan








More information about the Beowulf mailing list