[Beowulf] Load Balance Shifts During Run of Fixed Balance Application

Thu Mar 1 10:51:25 PST 2007

We have a parallel problem that shifts its load balance while executing 
even though we are certain that it shouldn't.  The following will describe 
our experience level, our clusters, our application, and the problem.

Our Experience

We are the developers of an MPI parallel application -- a 2-d 
time-dependent multiphysics code -- with all the intimate knowledge of its 
architecture and implementation that implies.  We are presently using the 
Portland Group Fortran and C compilers and MPICH-1 version 1.2.7.  We have 
had success building and using other parallel applications on HPC systems 
and clusters of workstations, though in those cases the physics was 
3-d.  We have plenty of Linux workstation sysadmin experience.

Our House-Built Clusters

We have built a few, small, generally heterogeneous clusters of 
workstations around AMD processors, Netgear GA311 NICs, and different 
switches.  We used Redhat 8 and 9 for our 32-bit processors, and have 
shifted to Fedora for our recent systems including our few ventures into 
64-bit land.  Some of our nodes have dual processors.  We have not tuned 
the OSs at all, other than to be sure that our NICs have appropriate 
drivers.  Some of our switches give us 80-90% of Gb speed as measured by 
NetPipe, both TCP-IP and MPI, and others give us 30%.  In the case 
described here, the switch is a slower one, but the application's 
performance is determined by the latency since the messages are relatively 
small.  Our only performance tools are the LINUX utility top and a stopwatch.

Our Application Architecture and Performance Expectation

During execution, the application takes thousands of steps that each 
advance simulation time.  The processors advance through the different 
physics packages and parts thereof in lock step from one MPIWaitAll to the 
next, with limited amounts of work being done between the barriers.  We use 
MPIAllReduce to do maximums, minimums, and sums of various quantities.

The application uses a domain decomposition that does not change during 
each run.  Each time step is roughly the same amount of work as previous 
ones, though the number of iterations in the implicit solution methods 
changes.  However, all processors are taking the same number of iterations 
in each time step.  Thus we expect that the relative load on a processor 
will remain roughly the same as the relative size of the domain it is 
assigned in the decomposition.  The problem is that it doesn't.

There is one exception to our expectation, in that intermittently after 
some number of time steps or interval of simulation time, the application 
does output.  Each processor writes some dump files identified  with its 
node number to a problem directory, and a single processor combines those 
files into one while all the other processors wait.  By controlling the 
frequency of the output, we keep the total time lost in this wait 
relatively small.  In addition, every ten cycles, the output processor 
writes a brief summary of the problem state to the terminal output.

One more thing before we get to the problem.  We don't use mpirun; our 
application reads a processor group file and starts the remote processes 
itself.  Thus, there is one processor that is distinguished from the 
others: it was directly invoked from the command line of a shell -- usually 
tcsh, but never mind that religious war.

The Problem

We have observed unexpected and extreme load-balance shifts during both 
two- and four-processor runs.  In the following, our focus will be on the 
four processor run.  We observe the load balance by monitoring CPU usage on 
each of the processors with separate xterm-invoked tops from a non-cluster 
machine.  Our primary observable is %CPU;  as a secondary observable, we 
monitor the wall time interval between the 10-cycle terminal edit.

The load balance starts out looking like the relative sizes of the domains 
we assigned to the various processors, just as we expect.  The processor on 
which the run was started has the smallest domain to handle, and its %CPU 
is initially around 50%, while the others are around 90%.  After a few 
hundred time steps or so the CPU usage of the processor on which the job 
was started begins to increase and the others begin to fall.  After a 
thousand time steps or so, the CPU usage is nearly 90% for the originating 
process, and less than 20% for the remote processes.  Not surprisingly, the 
wall time between 10-cycle terminal edits goes up by a factor of 4 over the 
same period.  By observation, no other task ever consumes more than a few 
tenths of a percent of the CPU.

The originating processor is the output processor, but only the terminal 
output is happening during this period, and we observe no significant 
change in the CPU usage during the cycles when that output is 
produced.  Top is updating its output every 5 seconds and in this run our 
application is taking one time step every 2 seconds.  The message count and 
size of the messages imply that two processors are spending about 30% of 
their time in system time for message startup and about a tenth that much 
actually transmitting data.  There are about 6,000 messages sent and 
received in each time step on those processors, though it varies slightly 
from time step to time step.  The other two processors -- one of which is 
the originating processor -- have about half that many messages to send and 
receive, and spend correspondingly less time doing it.

Though we have shuffled the originating processor and the processors in the 
group the results are always similar.  In one case we ran with four 
identical nodes except that one had Redhat 8 while the others were Redhat 
9.  In another case we ran four Redhat 9 machines with slightly different 
AMD processor speeds (2.08 vs 2.16 GHz).  The 9.0 kernels are 2.4.20, while 
the 8.0 has been upgraded to 2.4.18.

Here is a final bit of data.  To prove that the shift was not determined by 
the state of the problem being simulated, we restarted the simulation from 
a restart dump made by our application when the load had shifted to the 
originating processor.  The load balance immediately after the restart 
again reflected the domain size as it had in the beginning of the 
unrestarted simulation.  After a thousand cycles in the restarted problem, 
the load had shifted back to the originating processor.

Conclusion/Hypothesis

Our tentative conclusion is that either MPICH or the operating system is 
eating an increasing amount of time on the originating processor as the 
number of time steps accumulates.  It is probable that the accumulated 
number of messages transmitted is the problem.  It acts like a leak, but of 
processor CPU time rather than memory.  Top does not show any increase in 
resident set size (RSS) during the run.

Does anyone have any ideas what this behavior might be, how we can test for 
it, and what we can do to fix it?  Thanks for any help in advance.

Mike