Scyld + myrinet mpich-gm?

Dave Johnson ddj at cascv.brown.edu
Thu Feb 22 08:52:30 PST 2001


In the nearly three weeks since my posting of Feb 4, I have managed to
sort out most of the problems (including new ones that came up in the
process) getting MPICH-GM jobs started on the diskless slave nodes using
bproc/bpsh.  I still wish there were searchable archives of this and other
mailing lists available.  Right now I'm looking for pointers on tuning the
network interface (the gigabit card on the master is showing a fair number
of TX drops).

I will insert a few notes below as to what went on....

On Sun, Feb 04, 2001 at 12:15:58AM -0500, Dave Johnson wrote:
> I've gotten myself involved in bringing a small cluster up and
> into production.  I'm learning as I go, with the help of the
> archives of this mailing list.  Unfortunately the searchable
> archives at Supercomputer.org seem to be off line (I get internal
> server error), and out of date (the last messages seem to be from
> around May 2000).
> 
> The current setup is one master with 100base-T to the world, gigabit
> fiber to a 16-10/100 + 2-1000 switch, and 12 diskless slaves with
> 10/100 and myrinet interfaces.  The Scyld release of last Monday is
> up and running, and I can bpsh to my heart's content.
> 
> I'm stuck at the point of trying to deploy MPI.  Scyld supplies mpi-beowulf
> which does not appear to me to use bproc, and /usr/bin/mpirun and mpprun
> which do.  I've built the mpich-gm from Myricom, but their mpirun command
> does not grok bpsh, and expects either rsh or ssh daemons on each slave.
> 

I was clearly confused here -- the software in /usr/mpi-beowulf is linked
against the bproc libraries, but the /usr/mpi-beowulf/bin/mpirun script
is generic mpich.

The /usr/bin/mpirun binary is intended to replace the script, but there are
some limitations, some of which were mentioned before:
	- the script version handles the case where "." is not in $PATH,
	  but the binary mpirun doesn't.
	- the binary mpirun doesn't ignore -mvhome and -mvback.
	- the error in this case is misleading -- instead of complaining about
	  unrecognized options, it says
		"Failed to exec target program: No such file or directory".
	- I tried to get the mpprun SRPM to build properly, but gave up.

> I've tried a number of approaches that start out looking like they might
> work, but have gotten stuck after a few hours down each cowpath.
> 
> Here is a list of some of the snags (I've lost track of some others):
> 
> bpsh is not a full blown shell, doesn't deal well with redirection, changing
> directory before running a command, and in particular it can't be swapped for
> rsh or ssh when configuring mpich (ie -rsh=bpsh).

Bproc's bpsh does have the interesting feature that the current working
directory on the master is inherited by the slave, as long as it exists on
the slave.  This made it possible to hack mpirun.ch_gm.in to use bpsh.

> 
> The master node is outside the myrinet, I haven't a clue how to get
> it to cooperate with the slaves over ethernet yet have the slaves
> use myrinet as much as possible.

Starting from the suggestions I received, I was able to do this.

The master's gigabit interface is connected to a switch with
16 RJ45 10/100 ports and 2 fiber gigabit ports.  The 10/100/1000
net is named beonet (192.168.1.0) and the myrinet is 192.168.2.0.

The master's address on beonet (eth1) is 192.168.1.1.
For the master I added the line:

eth1 net myrinet netmask 255.255.255.0

to /etc/sysconfig/static-routes.

For the slaves, I added a hook in the /etc/beowulf/node_up script to
call two of my new scripts, setup_local and setup_myrinet.  setup_local 
computes IP addresses from the node number, sets the proper hostname,
creates a slave version of the resolv.conf file, changes the IP routing
to use myrinet GM-IP whenever possible, and sets some network parameters
via /proc/sys/net/core.  setup_myrinet is similar, but it first loads the
GM driver and brings up the myri0 interface.  It also starts the GM mapper
on one of the nodes.

In the end, the master's routes look like:
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
master.beonet   *               255.255.255.255 UH    0      0        0 eth1
128.148.160.xx  *               255.255.255.255 UH    0      0        0 eth0
myrinet         *               255.255.255.0   U     0      0        0 eth1
beonet          *               255.255.255.0   U     0      0        0 eth1
128.148.160.0   *               255.255.255.0   U     0      0        0 eth0
loopback        *               255.0.0.0       U     0      0        0 lo
default         128.148.160.yy  0.0.0.0         UG    0      0        0 eth0

The slaves have routing like:
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
master.beonet   *               255.255.255.255 UH    0      0        0 eth0
myrinet         *               255.255.255.0   U     0      0        0 myri0
beonet          *               255.255.255.0   U     0      0        0 myri0
loopback        *               255.0.0.0       U     0      0        0 lo
default         master.beonet   0.0.0.0         UG    0      0        0 eth0

and an /etc/resolv.conf file like:
domain cfm.brown.edu
nameserver 192.168.1.1
search beonet cfm.brown.edu 

> 
> I tried hacking on the first test in mpich-1.2..4/examples/test
> (pt2pt/third) that you get when you do make testing or runtests -check.
> Tried to get it to use /usr/bin/mpirun.  Had to get rid of -mvhome and
> -mvback args first, then tried to use bpsh to start up the mpirun on
> one node, hoping it could use GM to start up on the other slaves.
> After creating the directory in /var where it could create shm_beostat,
> 
> Now I get truckloads of errors:
> shmblk_open: Couldn't open shared memory file: /shm_beostat
> shmblk_open failed.
> 
> I suppose these might be from the other nodes, expecting everyone is
> sharing /var, but I'm leery of nfs mounting all of the master's /var
> on each slave.

What worked here was to make sure "." was on my path, hack out the -mvhome
and -mvback arguments from the runtests script, ignore bpsh, and use
/usr/bin/mpirun instead of ../../bin/mpirun.

> 
> I tried applying the Scyld patches against the 1.2.0 mpich sources to
> the 1.2..4 sources from Myricom, but most of them went into the mpid/ch_p4
> directory, which is not built when --with-device=ch_gm is specified.
> 
> Then I thought I'd look into the mpprun sources, but I couldn't get
> them to build even before I started hacking on them... decided to look
> elsewhere for a while.
> 
> Tried getting sshd2 up and running on a slave node.  So far it insists
> on asking for my password and won't accept it at all.

Having tried all these blind alleys, I concentrated on mpich-gm, and
stopped playing with /usr/bin/mpirun.  As it worked out, I had to make
changes to only one file, mpirun.ch_gm.in, and then I was able to run
the scripts in examples/test and examples/pertest.  The patch will be
attached (hopefully) to this message.

Hope this is helpful so somebody out there.  Thanks for the tips that
got me going again.

	-- ddj

	Dave Johnson
	ddj at cascv.brown.edu
-------------- next part --------------
--- mpid/ch_gm/mpirun.ch_gm.in.orig	Mon Aug 21 19:46:18 2000
+++ mpid/ch_gm/mpirun.ch_gm.in	Thu Feb 15 15:21:03 2001
@@ -37,6 +37,7 @@
 		if ($_ eq $host) {
 		   exec($cmd_ln);
 	        } else {
+		   s/.*\D0*(\d+)$/$1/ if $bpsh;
 		   exec($rsh,'-n',$_,$cmd_ln);
 		}
            }	
@@ -63,6 +64,7 @@
 $SIG{'QUIT'} = 'cleanup';
 
 $rsh="#RSHCOMMAND#";
+$bpsh = ($rsh =~ m|/?bpsh$|);
 $host = $ENV{'HOST'} || `uname -n`;
 chomp $host;
 $display=$ENV{'DISPLAY'};
@@ -262,6 +264,7 @@
     $_ = read_line;
     die "bad line in $gmpifile: $_" unless /^([^\s]*)\s+(\d+)/;
     $mach[$i] = $1;
+    $mach[$i] = $host if ($1 eq "master");
 #    $node_id[$i] = $2;
     $port_id[$i] = $2;
     $board_id[$i] = 0;
@@ -335,9 +338,10 @@
 	  -d "$dir" or mkdir("$dir",0777) or die "cannot make directory $dir\n"
 	}
         $gmpi_opts = " GMPI_OPTS=m$lnode,n$nbnode ";
+	$cmdpref = "cd $dir;$mget env $varenv $gmpi_opts";
 	if (defined($debug[$lnode])) {
 	    my $cmd = $argv{$lnode}->[0];
-	    $cmdline = "cd $dir;$mget env $varenv $gmpi_opts xterm -e gdb $cmd $mrel";
+	    $cmdsuff = "xterm -e gdb $cmd $mrel";
 	} elsif ($tview){
 #	    my $cmd = "@{$argv{$lnode}}";
 	    #
@@ -354,16 +358,17 @@
 		for ($j=1;$j<=$numargs;$j++) {
 		    $cmdLineArgs = $cmdLineArgs . " $argv{$lnode}->[$j]";
 		}
-		$cmdline = "cd $dir;$mget env $varenv $gmpi_opts $totalview $cmd -a -mpichtv $cmdLineArgs";
+		$cmdsuff = "$totalview $cmd -a -mpichtv $cmdLineArgs";
 	    } else {
 		my $cmd = "@{$argv{$lnode}}";
-		$cmdline = "cd $dir;$mget env $varenv $gmpi_opts $cmd -mpichtv $mrel";
+		$cmdsuff = "$cmd -mpichtv $mrel";
 	    }
 
       	} else {
 	    my $cmd = "@{$argv{$lnode}}";
-	    $cmdline = "cd $dir;$mget env $varenv $gmpi_opts $cmd $mrel";
+	    $cmdsuff = "$cmd $mrel";
 	}
+	$cmdline = $cmdpref . $cmdsuff;
 
 	print STDERR "starting on $_: $cmdline\n" if ($verbose || !$doit);
 #	print "starting on $_: $cmdline\n";
@@ -371,11 +376,16 @@
 	    if ($_ eq $host) {
 		exec($cmdline);
 	    } else {
+		if ($bpsh) {
+		    s/.*\D0*(\d+)$/$1/;
+		    $cmdline = $cmdpref . "$rsh -n $_ " . $cmdsuff;
+		    exec($cmdline);
+		} else {
 		exec($rsh,'-n',$_,$cmdline);
+		}
 	    }
 	    die "$rsh $_ $argv{$lnode}->[0]:$!\n" 
-	    }
-	else {
+	} else {
 	    exit 0;
 	}
     }
-------------- next part --------------
The first change, after the original line 40, is to convert hostnames such
as node01 or slave-20 into just the numerical suffix, without any leading zeros.

The new line of code after line 65 is to set the $bpsh flag based on the tail
of $rsh matching "bpsh".  This made later changes simpler and more readable.

The addition after line 264 is a gross hack to deal with the fact that the
master node is multi-homed, and `uname` gives "trapeza" but 192.168.1.1 maps
to "master" in reverse DNS lookups.

The new code after line 337 sets the prefix $cmdpref, which had been the 
same for all the cases to follow.  The original line 340 now just sets the
suffix part.  I haven't tried any debugger on the cluster yet... the changes 
would be necessary to get a debugger started via bpsh, but are probably not
enough by themselves.  Lines 357, 360, 365, and the line following 367 are
more of the same.

The remaining changes after line 373 rearrange the command line if $bpsh
is set, and again the slave hostnames are translated into node numbers.



More information about the Beowulf mailing list