Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

BeoMPI doesn't work with more than 3 nodes

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Carlos J. García Orellana carlos at nernet.unex.es
Wed Mar 28 01:09:25 PST 2001


Hello,

I want to work with BeoMPI in our cluster, so I have started with examples.
First, it doesn't work because I was using a wrong 'mpirun' script, after
that,
the 'cpi' example works fine with 1, 2 or 3 processors.

However, when I try to use more nodes, it doesn´t work, why?

Please, which is the right setup to work with BeoMPI?.

Thanks.

Carlos.

PD: Output of executing 'cpi' with 3 and 4 nodes (p4dbg=10)

[root at nereapc mpiex]# mpirun -np 3 ./cpi -p4dbg 10
10: xm_30944: (-) using procgroup file /proc/self/fd/3
10: p0_30944: (0.000007) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30944: (0.001176) hostname in first line of procgroup is -1
10: p0_30944: (0.001228) hostname for first entry in proctable is -1
10: p0_30944: (0.001257) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30944: (0.001421) Beowulf: using beowulf version of gethostbyname_p4
10: rm_30946: (-) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30944: (0.050455) Beowulf: using beowulf version of gethostbyname_p4
10: rm_30948: (-) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30944: (0.098959) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30944: (0.101561) sent msg of type 1010101010 from 0 to 1 via socket
6
10: p0_30944: (0.101661) sent msg of type 1010101010 from 0 to 2 via socket
7
10: p0_30944: (0.101728) sent msg of type 1010101010 from 0 to 1 via socket
6
10: p0_30944: (0.101772) sent msg of type 1010101010 from 0 to 2 via socket
7
10: p0_30944: (0.101811) sent msg of type 1010101010 from 0 to 1 via socket
6
10: p0_30944: (0.101868) sent msg of type 1010101010 from 0 to 2 via socket
7
10: p0_30944: (0.141063) received type=1010101010, from=1
10: p0_30944: (0.141094) received type=1010101010, from=2
10: p0_30944: (0.141155) sent msg of type 1010101010 from 0 to 1 via socket
6
10: p0_30944: (0.141198) sent msg of type 1010101010 from 0 to 2 via socket
7
Process 0 on -1
10: p0_30944: (0.143268) sent msg of type 0 from 0 to 2 via socket 7
10: p0_30944: (0.143493) sent msg of type 0 from 0 to 1 via socket 6
Process 2 on 2
Process 1 on 1
10: p0_30944: (0.143752) received type=0, from=1
10: p0_30944: (0.143826) received type=0, from=2
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.000780
10: p0_30944: (0.143927) sent msg of type 0 from 0 to 2 via socket 7
10: p0_30944: (0.143973) sent msg of type 0 from 0 to 1 via socket 6
10: p0_30944: (0.144176) received type=0, from=2
10: p0_30944: (0.144249) sent msg of type 0 from 0 to 1 via socket 6
10: p0_30944: (0.144301) received type=0, from=1
10: p0_30944: (0.144346) sent msg of type 0 from 0 to 2 via socket 7
[root at nereapc mpiex]#

[root at nereapc mpiex]# mpirun -np 4 ./cpi -p4dbg 10
10: xm_30951: (-) using procgroup file /proc/self/fd/3
10: p0_30951: (0.000012) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30951: (0.001121) hostname in first line of procgroup is -1
10: p0_30951: (0.001220) hostname for first entry in proctable is -1
10: p0_30951: (0.001265) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30951: (0.001425) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30951: (0.001529) Beowulf: using beowulf version of gethostbyname_p4
10: rm_30953: (-) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30951: (0.050602) Beowulf: using beowulf version of gethostbyname_p4
10: rm_30955: (-) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30951: (0.099172) Beowulf: using beowulf version of gethostbyname_p4
10: rm_30957: (-) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30951: (0.147644) Beowulf: using beowulf version of gethostbyname_p4
10: p0_30951: (0.151243) sent msg of type 1010101010 from 0 to 1 via socket
6
10: p0_30951: (0.151315) sent msg of type 1010101010 from 0 to 2 via socket
7
10: p0_30951: (0.151384) sent msg of type 1010101010 from 0 to 1 via socket
6
10: p0_30951: (0.151449) sent msg of type 1010101010 from 0 to 2 via socket
7
10: p0_30951: (0.151493) sent msg of type 1010101010 from 0 to 1 via socket
6
10: p0_30951: (0.151557) sent msg of type 1010101010 from 0 to 2 via socket
7
p1_30953:  p4_error: Timeout in establishing connection to remote process: 0







More information about the Beowulf mailing list