Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] newbie question about mpich2 on heterogenous cluster

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

baenni at kiecks.de baenni at kiecks.de
Tue Mar 22 05:03:50 PST 2005


Dear List

I installed mpich2-1.0 on my little cluster (2 Linux nodes and 3 Solaris 
nodes). I first worked only on the two linux nodes, where the programms run 
without troubles. But when I would like to invoke the solaris nodes, i.e. 
when I run the programs on a heterogenous cluster, it ents up in error 
messages. For some reoson, the -arch parameter is not implemented in 
mpich2-1.0. 

Does anyone have experience with such problems? Can I run mpich2 on a 
heterogonous cluster?

Thanks in advance for any help





mpiexec -n 1 -host shaw -path /home1/00117cfd/CFD_3D/example/PARALLEL/cpi 
_cpi : -n 1 -host devienne  -path /home1/00117cfd/CFD_3D/example/PARALLEL/cpi 
_cpi : -n 1 -host gallay  -path /export/home/baenni/example/PARALLEL/cpi 
_cpi : -n 2 -host gallay1  -path /export/home/baenni/example/PARALLEL/cpi 
_cpi



aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(821): MPI_Bcast(buf=0x8145480, count=1, MPI_INT, root=0, 
MPI_COMM_WORLD) failed
MPIR_Bcast(229):
MPIC_Send(48):
MPIC_Wait(308):
MPIDI_CH3_Progress_wait(207): an error occurred while handling an event 
returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(492):
connection_recv_fail(1728):
MPIDU_Socki_handle_read(590): connection closed by peer (set=0,sock=1)
aborting job:
Fatal error in MPI_Bcast: Internal MPI error!, error stack:
MPI_Bcast(821): MPI_Bcast(buf=1786e0, count=1, MPI_INT, root=0, 
MPI_COMM_WORLD) failed
MPIR_Bcast(197):
MPIC_Recv(98):
MPIC_Wait(308):
MPIDI_CH3_Progress_wait(207): an error occurred while handling an event 
returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(849): [ch3:sock] received packet of 
unknown type (369098752)
rank 4 in job 19  shaw_33110   caused collective abort of all ranks
  exit status of rank 4: killed by signal 9



More information about the Beowulf mailing list