[Beowulf] Latency Sensitive Algorithms

Sat May 7 12:52:22 PDT 2005

Hi everyone,

I have been doing some research for the development of a next-generation 
CFD code for multiphase flow.  In the process, I have been investigating 
lots of algorithm options and trying to identify trade-offs.  One of our 
primary tasks is to improve parallel scalability, even for fairly small 
problems.  We can "solve" the scalability problem by running large cases 
or by using high speed interconnects, but that is not really a solution in 
my mind.

As most of you are aware, CFD often comes down to solutions of large,
sparse linear systems.  Most are either Krylov space solvers or
(algebraic) multigrid solvers...or some co-mingling of the two.  After
studying and using solvers in many of the large packages (PETSC, Hypre,
Trilinos), it seems that all of them suffer from latency limitations.  By
that, I mean none provide a mechanism that allows overlap of communication
and computation, at least not on the level of our problem space.  In fact,
by and large, the solver packages don't seem to try to do latency hiding
at all.  I don't think that is due to laziness.  These latency limitations 
are algorithmic.

Krylov methods in all of their forms use matrix/vector and vector/vector
products.  The matrix/vector product can use latency hiding, but the dot
products are barriered by a global sum.  For (relatively) high latency
networks, that is a performance limit that can't be removed.  In fact,
global communication of some sort is probably necessary for any
nonstationary iterative scheme.

In multigrid, relaxation on the lowest levels can take advantage of
latency hiding without mangling the algorithm too badly.  Parallel
multigrid typically uses Jacobi smoothing at parallel interfaces, so
changing the update order to do latency hiding isn't too problematic.  
For the coarser levels, the amount of data to smooth becomes much smaller.  
While the amount of data to be communicated on coarser levels also drops,
the network latency remains fixed.  Eventually, it is better to just
serialize the coarse mesh solution.  The AMG solver in OpenFOAM, for
example, just solves the equations with a Krylov solver once the coarse
mesh reaches a user-specified size.  This coarse grid treatment leads to a
latency-induced parallel scaling problem as well.

Is there any effort underway to build more latency tolerance sparse
equation algorithms?  I know of work done to reduce or clump the dot
product calculations in Krylov methods by reorganizing computations.  But
these schemes are careful to retain the same underlying algorithm (though
they still suffer stability issues).  I haven't seen work done on new
*parallel* algorithms that diverge from the classic serial algorithm.  
The only thing I have seen in this vein are the Schwartz decomposition
techniques commonly used to procondition Newton-Krylov solvers.

Perhaps there is a Krylov method that builds separate "local" spaces and
"global" spaces.  The local spaces use on-processor data allowing for fast
dot-products.  The "global" spaces might use older data that could be 
communicated while other work is being donw.

Or perhaps there is a multigrid scheme that abandons the cycling strategy 
and instead allows all of the grids to be smoothed simultaneously.  That 
would allow the interior calculations on the finer grids to hide the 
communication latency on all of the levels.  

I would think a latency-tolerant parallel linear equation algorithm/solver
would be of great value to many projects that do not run on ASCI hardware.  
Maybe there are other common latency-sensitive algorithms that need 
similary attention.  My CFD background makes me a bit myopic.

So, have I missed something or is this area of research as "dead" as it 
seems?  Have we accepts as writ that sparse linear equation solvers will 
always be suffer latency bottlenecks?

There is a whole parallel discussion about memory bandwidth requirements
of algorithms and what can be done to improve cache reuse, etc.  In some
ways, they are both asking the same question.  Isn't there a better
algorithm to solve sparse linear systems that preserves good convergence
but better fits our available hardware?

Thanks,

Mike