Scyld + myrinet mpich-gm?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Dave Johnson ddj at mookie.cis.brown.eduSat Feb 3 21:15:58 PST 2001
- Previous message: Q: Any parallel DBs for the cluster computers ?
- Next message: Scyld + myrinet mpich-gm?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've gotten myself involved in bringing a small cluster up and into production. I'm learning as I go, with the help of the archives of this mailing list. Unfortunately the searchable archives at Supercomputer.org seem to be off line (I get internal server error), and out of date (the last messages seem to be from around May 2000). The current setup is one master with 100base-T to the world, gigabit fiber to a 16-10/100 + 2-1000 switch, and 12 diskless slaves with 10/100 and myrinet interfaces. The Scyld release of last Monday is up and running, and I can bpsh to my heart's content. I'm stuck at the point of trying to deploy MPI. Scyld supplies mpi-beowulf which does not appear to me to use bproc, and /usr/bin/mpirun and mpprun which do. I've built the mpich-gm from Myricom, but their mpirun command does not grok bpsh, and expects either rsh or ssh daemons on each slave. I've tried a number of approaches that start out looking like they might work, but have gotten stuck after a few hours down each cowpath. Here is a list of some of the snags (I've lost track of some others): bpsh is not a full blown shell, doesn't deal well with redirection, changing directory before running a command, and in particular it can't be swapped for rsh or ssh when configuring mpich (ie -rsh=bpsh). The master node is outside the myrinet, I haven't a clue how to get it to cooperate with the slaves over ethernet yet have the slaves use myrinet as much as possible. I tried hacking on the first test in mpich-1.2..4/examples/test (pt2pt/third) that you get when you do make testing or runtests -check. Tried to get it to use /usr/bin/mpirun. Had to get rid of -mvhome and -mvback args first, then tried to use bpsh to start up the mpirun on one node, hoping it could use GM to start up on the other slaves. After creating the directory in /var where it could create shm_beostat, Now I get truckloads of errors: shmblk_open: Couldn't open shared memory file: /shm_beostat shmblk_open failed. I suppose these might be from the other nodes, expecting everyone is sharing /var, but I'm leery of nfs mounting all of the master's /var on each slave. I tried applying the Scyld patches against the 1.2.0 mpich sources to the 1.2..4 sources from Myricom, but most of them went into the mpid/ch_p4 directory, which is not built when --with-device=ch_gm is specified. Then I thought I'd look into the mpprun sources, but I couldn't get them to build even before I started hacking on them... decided to look elsewhere for a while. Tried getting sshd2 up and running on a slave node. So far it insists on asking for my password and won't accept it at all. Has anyone got a working cluster anything like the one we're building? What did you have to do differently to make the various packages and drivers play nice with each other? Where did I go wrong? Thanks, -- ddj Dave Johnson ddj at cascv.brown.edu Brown University TCASCV
- Previous message: Q: Any parallel DBs for the cluster computers ?
- Next message: Scyld + myrinet mpich-gm?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
