[Beowulf] MPICH-1.2.5 hangs on 16 node cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Sreenivasulu Pulichintala sreenivasulu at topspin.comFri Nov 19 01:07:18 PST 2004
- Previous message: [Beowulf] Infiniband price (was: torus versus (fat) tree topologies)
- Next message: [Beowulf] MPICH-1.2.5 hangs on 16 node cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, I see some strange behavior of the MPICH stack when running on a 16 node cluster. It goes to deadlock and hangs. On attaching the process through gdb the following stack was observed. ------------------------------------- on machine 1 -------------- #0 0x0000000041efb858 in poll_rdma_buffer () #1 0x0000000041efd2cb in viutil_spinandwaitcq () #2 0x0000000041efba1e in MPID_DeviceCheck () #3 0x0000000041f0a36b in MPID_RecvComplete () #4 0x0000000041f02fc4 in MPI_Waitall () #5 0x0000000041eee8fc in MPI_Sendrecv () #6 0x0000000041ef460e in intra_Allreduce () #7 0x0000000041eec62c in MPI_Allreduce () #8 0x0000000041eeeab9 in mpi_allreduce_ () #9 0x00000000401ae133 in dynai_ () #10 0x0000000040006d08 in frame_dummy () -Process 2-------- #0 0x0000000041f018f8 in smpi_net_lookup () #1 0x0000000041f0188b in MPID_SMP_Check_incoming () #2 0x0000000041efd2b6 in viutil_spinandwaitcq () #3 0x0000000041efba1e in MPID_DeviceCheck () #4 0x0000000041f0a36b in MPID_RecvComplete () #5 0x0000000041f02fc4 in MPI_Waitall () #6 0x0000000041eee8fc in MPI_Sendrecv () #7 0x0000000041ef460e in intra_Allreduce () #8 0x0000000041eec62c in MPI_Allreduce () #9 0x0000000041eeeab9 in mpi_allreduce_ () #10 0x00000000401ae133 in dynai_ () #11 0x0000000040006d08 in frame_dummy () ------ On node 2 -------- #0 0x0000000041efb877 in poll_rdma_buffer () #1 0x0000000041efd2cb in viutil_spinandwaitcq () #2 0x0000000041efba1e in MPID_DeviceCheck () #3 0x0000000041f0a36b in MPID_RecvComplete () #4 0x0000000041f09ead in MPID_RecvDatatype () #5 0x0000000041f03569 in MPI_Recv () #6 0x0000000041eef42d in mpi_recv_ () #7 0x0000000041c0b153 in remdupslave_ () #8 0x000000000000cf6b in ?? () #9 0x000000000000c087 in ?? () #10 0x000000000002f4b4 in ?? () #11 0x000000000000c503 in ?? () #12 0x000000000000c575 in ?? () #13 0x000000000000040c in ?? () #14 0x00000000401ae313 in dynai_ () #15 0x0000000040006d08 in frame_dummy () --2nd process---- #0 0x0000000041efd2cb in viutil_spinandwaitcq () #1 0x0000000041efba1e in MPID_DeviceCheck () #2 0x0000000041f0a36b in MPID_RecvComplete () #3 0x0000000041f02fc4 in MPI_Waitall () #4 0x0000000041eee8fc in MPI_Sendrecv () #5 0x0000000041ef460e in intra_Allreduce () #6 0x0000000041eec62c in MPI_Allreduce () #7 0x0000000041eeeab9 in mpi_allreduce_ () #8 0x00000000401ae133 in dynai_ () #9 0x0000000040006d08 in frame_dummy () ----- Other machines processes seem to be caught up on MPI_Allreduce stack. Has anyone experienced similar kind of problem? Any help in this regard is highly appreciated. Thanks Sree -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20041119/183f8246/attachment.html
- Previous message: [Beowulf] Infiniband price (was: torus versus (fat) tree topologies)
- Next message: [Beowulf] MPICH-1.2.5 hangs on 16 node cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
