[Beowulf] MPICH problem [was: (no subject)]
reuti at staff.uni-marburg.de
Fri May 6 13:57:20 PDT 2005
if you compiled just with these flags, you will:
a) get compiled in ssh
b) no shared memory support.
What you need to use more than one node, is a passwordless login via ssh or
rsh. To switch to rsh you could ./configure -rsh=rsh or during the runtime of
the job "export P4_RSHCOMMAND=rsh" before you start a program.
To get shared memory support, you have to use the --with-comm=shared during
configure, otherwise the :2 in your .LINUX file will be ignored (and the file
simply scanned twice). OTOH: with shared memory compiled in you might get
shared memory and semaphores leftovers. Having dual nodes, maybe the speed
impact is low compared to the problems with leftovers and it's better not to
use any shared memory in this case. Then each nodes has to appear twice in the
.LINUX file (or also in a custom machinefile) to avoid double scanning.
But anyway: having a cluster with 32 nodes I'd suggest to look for a queuing
system like SUN GridEngine. Otherwise you might get jobs distributed just to
the first nodes of the cluster. And adjusting a machinefile each time by hand
Cheers - Reuti
Quoting maqsood at chep.pu.edu.pk:
> I am trying to setup MPICH-1.2.6 on a 32 nodes (dual cpu) cluster. I
> installed MPI under /usr/local/mpich-1.2.6 and followed following
> ./configure --with-device=ch_p4
> /util/machine.LINUX consist of
> When I run
> mpirun -v -nolcal -np 1 cpi it gives following output
> running /usr/local/mpich-1.2.6/bin/cpi on 1 LINUX ch_p4 processors
> Created /usr/local/mpich-1.2.6/bin/PI1529
> Process 0 of 1 on node1.chep.pu.edu
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000666
> but when I run on two or more nodes its give following error.
> mpirun -v -nolocal -np 2 cpi
> running /usr/local/mpich-1.2.6/bin/cpi on 2 LINUX ch_p4 processors
> Created /usr/local/mpich-1.2.6/bin/PI1371
> rm_3765: p4_error: rm_start: net_conn_to_listener failed: 32908
> p0_5964: p4_error: Child process exited while making connection to remote
> process on node2: 0
> ssh, nfs and nis are working fine.
> please help me to solve this problem.
> Maqsood Ahmed
> Assistant Professor
> Centre for High Energy Physics
> University of the Punjab
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf