[Beowulf] Load Balance Shifts During Run of Fixed Balance Application [RESOLVED]
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Michael H. Frese Michael.Frese at NumerEx.comMon Mar 5 10:38:30 PST 2007
- Previous message: [Beowulf] IB switches: managed or not?
- Next message: [Beowulf] network filesystem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thanks to those who took the time to consider my original description of our problem. It has now been resolved and the simulation load balance is remaining fixed over thousands of time steps. The problem, not surprisingly, was in our application code, specifically in our use of MPI in one particular place. We had posted some receives on the originating processor -- which was also the output processor -- for messages that were never sent. We failed to detect the error because -- in another error -- we had failed to do a WaitAll on the receive message queue for those messages. The result was that the originating/output processor had an ever increasing receive queue to hunt through while pairing up receives and arriving messages, and so took increasingly longer with each successive timestep. We also sent some messages to processors that did not exist, though I think this was less of a problem. We found the problem by looking for one a related kind. We built and ran a test code, and found accidently that failing to post receives caused processors to have to hunt through an increasing queue of received but unprocessed messages. Thanks again. Mike
- Previous message: [Beowulf] IB switches: managed or not?
- Next message: [Beowulf] network filesystem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
