Scyld + myrinet mpich-gm?
keithu at parl.clemson.edu
Mon Feb 5 07:14:52 PST 2001
Hmmm... we have something similar, but not quite the same. We have a
master w/ 100base-T to the world, gigabit fiber to a 24-10/100 + 2-1000
switch and 16 slaves (not diskless) with 10/100 and gigabit interfaces.
We only have 16 ports on our gigabit switch and out master is a different
type of machine from the 16 slaves. We have successfully convinced the
machines to communicate over the gigabit exclusively while communicating
with the master over the 10/100. You do need to use the Scyld MPI though.
I seriously doubt that you will get another MPI running as is.
Anyway, what we did was:
after bringing the nodes up:
bpsh -a route add -host 192.168.1.1 eth0
bpsh -a route del default
bpsh -a modprobe sk98lin
then on each node:
bpsh <node> ifconfig eth1 up <nodes current IP>
Then to run an MPI job that DOES NOT run on the head:
NO_INLINE_MPIRUN=true bpsh 0 mpiapp -p4pg /tmp/pgfile
where /tmp/pgfile is a p4 process group file.
This is a real sketchy config so don't expect too much support on it
just yet ;-)
On Sun, 4 Feb 2001, Dave Johnson wrote:
> I've gotten myself involved in bringing a small cluster up and
> into production. I'm learning as I go, with the help of the
> archives of this mailing list. Unfortunately the searchable
> archives at Supercomputer.org seem to be off line (I get internal
> server error), and out of date (the last messages seem to be from
> around May 2000).
> The current setup is one master with 100base-T to the world, gigabit
> fiber to a 16-10/100 + 2-1000 switch, and 12 diskless slaves with
> 10/100 and myrinet interfaces. The Scyld release of last Monday is
> up and running, and I can bpsh to my heart's content.
> I'm stuck at the point of trying to deploy MPI. Scyld supplies mpi-beowulf
> which does not appear to me to use bproc, and /usr/bin/mpirun and mpprun
> which do. I've built the mpich-gm from Myricom, but their mpirun command
> does not grok bpsh, and expects either rsh or ssh daemons on each slave.
> I've tried a number of approaches that start out looking like they might
> work, but have gotten stuck after a few hours down each cowpath.
> Here is a list of some of the snags (I've lost track of some others):
> bpsh is not a full blown shell, doesn't deal well with redirection, changing
> directory before running a command, and in particular it can't be swapped for
> rsh or ssh when configuring mpich (ie -rsh=bpsh).
> The master node is outside the myrinet, I haven't a clue how to get
> it to cooperate with the slaves over ethernet yet have the slaves
> use myrinet as much as possible.
> I tried hacking on the first test in mpich-1.2..4/examples/test
> (pt2pt/third) that you get when you do make testing or runtests -check.
> Tried to get it to use /usr/bin/mpirun. Had to get rid of -mvhome and
> -mvback args first, then tried to use bpsh to start up the mpirun on
> one node, hoping it could use GM to start up on the other slaves.
> After creating the directory in /var where it could create shm_beostat,
> Now I get truckloads of errors:
> shmblk_open: Couldn't open shared memory file: /shm_beostat
> shmblk_open failed.
> I suppose these might be from the other nodes, expecting everyone is
> sharing /var, but I'm leery of nfs mounting all of the master's /var
> on each slave.
> I tried applying the Scyld patches against the 1.2.0 mpich sources to
> the 1.2..4 sources from Myricom, but most of them went into the mpid/ch_p4
> directory, which is not built when --with-device=ch_gm is specified.
> Then I thought I'd look into the mpprun sources, but I couldn't get
> them to build even before I started hacking on them... decided to look
> elsewhere for a while.
> Tried getting sshd2 up and running on a slave node. So far it insists
> on asking for my password and won't accept it at all.
> Has anyone got a working cluster anything like the one we're building?
> What did you have to do differently to make the various packages and
> drivers play nice with each other? Where did I go wrong?
> -- ddj
> Dave Johnson
> ddj at cascv.brown.edu
> Brown University TCASCV
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Keith Underwood Parallel Architecture Research Lab (PARL)
keithu at parl.clemson.edu Clemson University
More information about the Beowulf