[Beowulf] MPICH-1.2.5 hangs on 16 node cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Greg Lindahl lindahl at pathscale.comSun Nov 21 10:10:43 PST 2004
- Previous message: [Beowulf] MPICH-1.2.5 hangs on 16 node cluster
- Next message: [Beowulf] torus versus (fat) tree topologies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Nov 19, 2004 at 02:37:18PM +0530, Sreenivasulu Pulichintala wrote: > I see some strange behavior of the MPICH stack when running on a 16 node > cluster. Is this stock MPICH? If not, you haven't included very much info about what you're actually running. In any case: > On node 2 > -------- > #0 0x0000000041efb877 in poll_rdma_buffer () > #1 0x0000000041efd2cb in viutil_spinandwaitcq () > #2 0x0000000041efba1e in MPID_DeviceCheck () > #3 0x0000000041f0a36b in MPID_RecvComplete () > #4 0x0000000041f09ead in MPID_RecvDatatype () > #5 0x0000000041f03569 in MPI_Recv () > #6 0x0000000041eef42d in mpi_recv_ () > #7 0x0000000041c0b153 in remdupslave_ () > #8 0x000000000000cf6b in ?? () > #9 0x000000000000c087 in ?? () > #10 0x000000000002f4b4 in ?? () > #11 0x000000000000c503 in ?? () > #12 0x000000000000c575 in ?? () > #13 0x000000000000040c in ?? () > #14 0x00000000401ae313 in dynai_ () > #15 0x0000000040006d08 in frame_dummy () This process seems to be in a Fortran mpi_recv() call and NOT in All_Reduce. This could be a programming error in your program. But it isn't clear if this stack trace isn't corrupt. -- greg p.s. It would be better if you posted to mailing lists in straight text instead of text and html.
- Previous message: [Beowulf] MPICH-1.2.5 hangs on 16 node cluster
- Next message: [Beowulf] torus versus (fat) tree topologies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
