[Beowulf] Load Balance Shifts During Run of Fixed Balance Application [RESOLVED]

Michael H. Frese Michael.Frese at NumerEx.com
Mon Mar 5 10:38:30 PST 2007


Thanks to those who took the time to consider my original description of 
our problem.  It has now been resolved and the simulation load balance is 
remaining fixed over thousands of time steps.

The problem, not surprisingly, was in our application code, specifically in 
our use of MPI in one particular place.  We had posted some receives on the 
originating processor -- which was also the output processor -- for 
messages that were never sent.  We failed to detect the error because -- in 
another error -- we had failed to do a WaitAll on the receive message queue 
for those messages.  The result was that the originating/output processor 
had an ever increasing receive queue to hunt through while pairing up 
receives and arriving messages, and so took increasingly longer with each 
successive timestep.

We also sent some messages to processors that did not exist, though I think 
this was less of a problem.

We found the problem by looking for one a related kind.  We built and ran a 
test code, and found accidently that failing to post receives caused 
processors to have to hunt through an increasing queue of received but 
unprocessed messages.

Thanks again.


Mike




More information about the Beowulf mailing list