[Beowulf] Load Balance Shifts During Run of Fixed Balance
Michael H. Frese
Michael.Frese at NumerEx.com
Thu Mar 1 10:51:25 PST 2007
We have a parallel problem that shifts its load balance while executing
even though we are certain that it shouldn't. The following will describe
our experience level, our clusters, our application, and the problem.
We are the developers of an MPI parallel application -- a 2-d
time-dependent multiphysics code -- with all the intimate knowledge of its
architecture and implementation that implies. We are presently using the
Portland Group Fortran and C compilers and MPICH-1 version 1.2.7. We have
had success building and using other parallel applications on HPC systems
and clusters of workstations, though in those cases the physics was
3-d. We have plenty of Linux workstation sysadmin experience.
Our House-Built Clusters
We have built a few, small, generally heterogeneous clusters of
workstations around AMD processors, Netgear GA311 NICs, and different
switches. We used Redhat 8 and 9 for our 32-bit processors, and have
shifted to Fedora for our recent systems including our few ventures into
64-bit land. Some of our nodes have dual processors. We have not tuned
the OSs at all, other than to be sure that our NICs have appropriate
drivers. Some of our switches give us 80-90% of Gb speed as measured by
NetPipe, both TCP-IP and MPI, and others give us 30%. In the case
described here, the switch is a slower one, but the application's
performance is determined by the latency since the messages are relatively
small. Our only performance tools are the LINUX utility top and a stopwatch.
Our Application Architecture and Performance Expectation
During execution, the application takes thousands of steps that each
advance simulation time. The processors advance through the different
physics packages and parts thereof in lock step from one MPIWaitAll to the
next, with limited amounts of work being done between the barriers. We use
MPIAllReduce to do maximums, minimums, and sums of various quantities.
The application uses a domain decomposition that does not change during
each run. Each time step is roughly the same amount of work as previous
ones, though the number of iterations in the implicit solution methods
changes. However, all processors are taking the same number of iterations
in each time step. Thus we expect that the relative load on a processor
will remain roughly the same as the relative size of the domain it is
assigned in the decomposition. The problem is that it doesn't.
There is one exception to our expectation, in that intermittently after
some number of time steps or interval of simulation time, the application
does output. Each processor writes some dump files identified with its
node number to a problem directory, and a single processor combines those
files into one while all the other processors wait. By controlling the
frequency of the output, we keep the total time lost in this wait
relatively small. In addition, every ten cycles, the output processor
writes a brief summary of the problem state to the terminal output.
One more thing before we get to the problem. We don't use mpirun; our
application reads a processor group file and starts the remote processes
itself. Thus, there is one processor that is distinguished from the
others: it was directly invoked from the command line of a shell -- usually
tcsh, but never mind that religious war.
We have observed unexpected and extreme load-balance shifts during both
two- and four-processor runs. In the following, our focus will be on the
four processor run. We observe the load balance by monitoring CPU usage on
each of the processors with separate xterm-invoked tops from a non-cluster
machine. Our primary observable is %CPU; as a secondary observable, we
monitor the wall time interval between the 10-cycle terminal edit.
The load balance starts out looking like the relative sizes of the domains
we assigned to the various processors, just as we expect. The processor on
which the run was started has the smallest domain to handle, and its %CPU
is initially around 50%, while the others are around 90%. After a few
hundred time steps or so the CPU usage of the processor on which the job
was started begins to increase and the others begin to fall. After a
thousand time steps or so, the CPU usage is nearly 90% for the originating
process, and less than 20% for the remote processes. Not surprisingly, the
wall time between 10-cycle terminal edits goes up by a factor of 4 over the
same period. By observation, no other task ever consumes more than a few
tenths of a percent of the CPU.
The originating processor is the output processor, but only the terminal
output is happening during this period, and we observe no significant
change in the CPU usage during the cycles when that output is
produced. Top is updating its output every 5 seconds and in this run our
application is taking one time step every 2 seconds. The message count and
size of the messages imply that two processors are spending about 30% of
their time in system time for message startup and about a tenth that much
actually transmitting data. There are about 6,000 messages sent and
received in each time step on those processors, though it varies slightly
from time step to time step. The other two processors -- one of which is
the originating processor -- have about half that many messages to send and
receive, and spend correspondingly less time doing it.
Though we have shuffled the originating processor and the processors in the
group the results are always similar. In one case we ran with four
identical nodes except that one had Redhat 8 while the others were Redhat
9. In another case we ran four Redhat 9 machines with slightly different
AMD processor speeds (2.08 vs 2.16 GHz). The 9.0 kernels are 2.4.20, while
the 8.0 has been upgraded to 2.4.18.
Here is a final bit of data. To prove that the shift was not determined by
the state of the problem being simulated, we restarted the simulation from
a restart dump made by our application when the load had shifted to the
originating processor. The load balance immediately after the restart
again reflected the domain size as it had in the beginning of the
unrestarted simulation. After a thousand cycles in the restarted problem,
the load had shifted back to the originating processor.
Our tentative conclusion is that either MPICH or the operating system is
eating an increasing amount of time on the originating processor as the
number of time steps accumulates. It is probable that the accumulated
number of messages transmitted is the problem. It acts like a leak, but of
processor CPU time rather than memory. Top does not show any increase in
resident set size (RSS) during the run.
Does anyone have any ideas what this behavior might be, how we can test for
it, and what we can do to fix it? Thanks for any help in advance.
More information about the Beowulf