[Beowulf] MPICH problem [was: (no subject)]

Reuti reuti at staff.uni-marburg.de
Fri May 6 13:57:20 PDT 2005


Hi,

if you compiled just with these flags, you will:

a) get compiled in ssh

b) no shared memory support.

What you need to use more than one node, is a passwordless login via ssh or 
rsh. To switch to rsh you could ./configure -rsh=rsh or during the runtime of 
the job "export P4_RSHCOMMAND=rsh" before you start a program.

To get shared memory support, you have to use the --with-comm=shared during 
configure, otherwise the :2 in your .LINUX file will be ignored (and the file 
simply scanned twice). OTOH: with shared memory compiled in you might get 
shared memory and semaphores leftovers. Having dual nodes, maybe the speed 
impact is low compared to the problems with leftovers and it's better not to 
use any shared memory in this case. Then each nodes has to appear twice in the 
.LINUX file (or also in a custom machinefile) to avoid double scanning.

But anyway: having a cluster with 32 nodes I'd suggest to look for a queuing 
system like SUN GridEngine. Otherwise you might get jobs distributed just to 
the first nodes of the cluster. And adjusting a machinefile each time by hand 
is cumbersome.

Cheers - Reuti


Quoting maqsood at chep.pu.edu.pk:

> I am trying to setup MPICH-1.2.6 on a 32 nodes (dual cpu) cluster. I
> installed MPI under /usr/local/mpich-1.2.6 and followed following
> procedure.
> ./configure --with-device=ch_p4
> make
> 
> /util/machine.LINUX consist of
> node1:2
> node2:2
> .
> .
> node32:2
> 
> When I run
> mpirun -v -nolcal -np 1 cpi it gives following output
> 
> running /usr/local/mpich-1.2.6/bin/cpi on 1 LINUX ch_p4 processors
> Created /usr/local/mpich-1.2.6/bin/PI1529
> Process 0 of 1 on node1.chep.pu.edu
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000666
> 
> but when I run on two or more nodes its give following error.
> 
> mpirun -v -nolocal -np 2 cpi
> running /usr/local/mpich-1.2.6/bin/cpi on 2 LINUX ch_p4 processors
> Created /usr/local/mpich-1.2.6/bin/PI1371
> rm_3765:  p4_error: rm_start: net_conn_to_listener failed: 32908
> p0_5964:  p4_error: Child process exited while making connection to remote
> process on node2: 0
> 
> ssh, nfs and nis are working fine.
> 
> please help me to solve this problem.
> 
> Maqsood Ahmed
> Assistant Professor
> Centre for High Energy Physics
> University of the Punjab
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 





More information about the Beowulf mailing list