[Beowulf] MPICH problem [was: (no subject)]
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Reuti reuti at staff.uni-marburg.deFri May 6 13:57:20 PDT 2005
- Previous message: [Beowulf] (no subject)
- Next message: [Beowulf] Gen - 1 Clusters
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, if you compiled just with these flags, you will: a) get compiled in ssh b) no shared memory support. What you need to use more than one node, is a passwordless login via ssh or rsh. To switch to rsh you could ./configure -rsh=rsh or during the runtime of the job "export P4_RSHCOMMAND=rsh" before you start a program. To get shared memory support, you have to use the --with-comm=shared during configure, otherwise the :2 in your .LINUX file will be ignored (and the file simply scanned twice). OTOH: with shared memory compiled in you might get shared memory and semaphores leftovers. Having dual nodes, maybe the speed impact is low compared to the problems with leftovers and it's better not to use any shared memory in this case. Then each nodes has to appear twice in the .LINUX file (or also in a custom machinefile) to avoid double scanning. But anyway: having a cluster with 32 nodes I'd suggest to look for a queuing system like SUN GridEngine. Otherwise you might get jobs distributed just to the first nodes of the cluster. And adjusting a machinefile each time by hand is cumbersome. Cheers - Reuti Quoting maqsood at chep.pu.edu.pk: > I am trying to setup MPICH-1.2.6 on a 32 nodes (dual cpu) cluster. I > installed MPI under /usr/local/mpich-1.2.6 and followed following > procedure. > ./configure --with-device=ch_p4 > make > > /util/machine.LINUX consist of > node1:2 > node2:2 > . > . > node32:2 > > When I run > mpirun -v -nolcal -np 1 cpi it gives following output > > running /usr/local/mpich-1.2.6/bin/cpi on 1 LINUX ch_p4 processors > Created /usr/local/mpich-1.2.6/bin/PI1529 > Process 0 of 1 on node1.chep.pu.edu > pi is approximately 3.1415926544231341, Error is 0.0000000008333410 > wall clock time = 0.000666 > > but when I run on two or more nodes its give following error. > > mpirun -v -nolocal -np 2 cpi > running /usr/local/mpich-1.2.6/bin/cpi on 2 LINUX ch_p4 processors > Created /usr/local/mpich-1.2.6/bin/PI1371 > rm_3765: p4_error: rm_start: net_conn_to_listener failed: 32908 > p0_5964: p4_error: Child process exited while making connection to remote > process on node2: 0 > > ssh, nfs and nis are working fine. > > please help me to solve this problem. > > Maqsood Ahmed > Assistant Professor > Centre for High Energy Physics > University of the Punjab > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: [Beowulf] (no subject)
- Next message: [Beowulf] Gen - 1 Clusters
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
