[Beowulf] mem consumption strategy for HPC apps?

Thu Apr 14 06:46:12 PDT 2005

On Thu, 14 Apr 2005, Toon Knapen wrote:

> What is the ideal way to manage memory consumption in HPC applications?

Dear Toon (great name:-):

I don't think that there is any simple answer to your question, because
of the wide variability of hardware upon which one can run and the
numerous boundaries where superlinear/nonlinear speedup or slowdown can
occur.

ATLAS (and other self-adaptive numerical algorithms) are a case in point
-- ATLAS goes through a very complex build/search process to tune just
to the primary memory subsystems on a single host -- registers, L1
cache, L2 cache, and main memory.  There is more than an order of
magnitude of variability in latency and effective BW across all of this
for arbitrary access patterns, and an even greater variability across
architectures and memory types and speeds (e.g. a 64 bit Opteron with
very high speed memory and a 32 bit Celeron with a small L2 cache and
old/slow memory).

The idea of building ATLAS (or other applications) so that they can tune
from a relatively small set of "benchmark" measurements or system
parameters is a good one, but is currently a research question with no
clear answer or result.

Note that this is the same problem that you are addressing HPC
applications; HPC just adds additional layers to the memory hierarchy,
with a range of latency and BW costs to e.g. copy to or read from memory
locations on remote nodes vs the full range of local costs described
above (or in a message passing model, the time required to send and/or
receive blocks of data as messages).  Outside of both hierarchies there
are the costs (in time) of using virtual memory in the form of local
disk, which are generally orders of magnitude slower than the slowest
form of local memory.  I >>think<< that disk-based VM is still an order
of magnitude or two slower than accessing remote system memory
(either/any model) over the network, but networks and disks keep
changing and I'm not so certain of those times any more.

The issue of partitioning your application is further complicated by
CLUSTER variability -- some clusters have 1 GB/CPU nodes, some have 4
GB/CPU nodes, some have 250 MB/CPU nodes, some (lucky) clusters have the
full 8 GB/CPU you describe.  MMUs differ.  Newer CPUs (especially the
dual cores just starting to emerge) have very different basic memory
architectures, and there is significant variation between Intel and AMD
CPUs at nominally equivalant clock(s) and data path widths.

>From the above it should be clear that writing a fully tuned "big memory
footprint" HPC application is something that:

  a) Probably will need to be done for a >>specific<< cluster (just as
ATLAS is (re)built for >>specific<< CPU/memory architectures).  Note
that there is a clear trade-off of specificity (and local optimization)
and portability which can significantly increase TCO down the road when
you have to effectively rewrite the application for a new/different
cluster.  Writing something automatically portable to a radically
different architecture is No Mean Trick!  Which is why I continue to
think of ATLAS as a toolset a decade or so ahead of its time...

  b) May involve different ALGORITHMS for distinct cluster
architectures.  ATLAS switches algorithms as well as block sizes at
empirically determined certain breakpoints (and the trick is to match
the BEST blocking and algorithm with the particular memory
configuration, CPU speed, and problem size).  

  c) Will involve different BLOCKING (or partitioning -- the main thing
you address below, I think, but only one aspect of the problem) of the
problem for distinct cluster architectures, with strong dependence on
both network and memory size.  If one is stuck with 100 BT one will
almost certainly end up with a very different performance surface than
one will with e.g. Myrinet, SCI, IB.  If one has 1 GB nodes vs 8 GB
nodes ditto.

  d) Is clearly a nontrivial problem to solve.  The solution will
require a set of tedious measurements, a careful study of the problem
itself, and a fair amount of experimentation after you THINK you
understand the various rate dependences until you at least reach a local
maximum in performance.  

  e) In a high dimensional space there is no guarantee that this local
maximum will be a global maximum or even a particularly good local
maximum; to find a "good" local maximum in a complex problem or have a
hope of finding the global maximum, one will have to do enough research
to be pretty sure that you've got at least the optimum algorithms
(algorithm changes tend to be discrete "far jumps" on the performance
surface) and that those algorithms have the right projective variables
so that a "good" solution exists within the space of configurations they
span.  

So it MAY not be easy (this is a worst case analysis, but there are
whole classes of problems with decompositions with trivial optima, e.g.
the entire class of embarrassingly parallel applications with no IPCs at
all).  Still, even for complex problem ATLAS's factor of 2-3 improvement
in performance clearly shows us that the pay-off can be several times
the cost of your entire cluster (or years of your life not spent waiting
for results.  Measured in dollars or otherwise, there is probably more
raw value locked up here for at least some problems than in any other
single aspect of cluster performance being worked on today.

One step beyond even this is co-engineering your cluster AND your
application to achieve a REAL global optimum in total cost of ownership
and benefit derived from running the application.  Here one has to
factor in not only the optimized rate of the application built for
various cluster architectures you might want to build/buy, but the cost
of the alternative clusters including all operational and infrastructure
costs, compared to the benefits gained in terms of running things
faster.

Here again there are whole spaces of applications for which there is a
trivial projective decomposition and fairly obvious maxima (CPU-bound
embarrassingly parallel applications with little marginal benefit
associated with finishing incrementally faster and with a fixed budget
to spend according to some schedule).  Then there are the exceptions,
where one really needs to work HARD matching the ideally programmed
application to the ideal hardware in the ideal environment to get a
factor of three greater benefit by whatever metric most matters to you
per dollar spent.

The difficulty of this problem is what has spawned various cluster
consulting firms, such as Joe's Scalable Informatics.  There people with
a fair bit of experience in cluster engineering and hardware (all
rapidly varying stuff where one has to WORK to stay current) help
customers get at least within spitting distance of an optimal
configuration for various problems in which they have domain expertise
on a contractual/consultative basis.  Obviously with as much as an order
of magnitude difference in price/performance on the table, there is
plenty of marginal benefit to make dealing with such a company
worthwhile for certain classes of problems, depending also on the degree
of local cluster expertise one has to draw on.

It is also one of the factors that drives e.g. Scyld, which provides
streamlined and pre-optimized cluster operating systems.  These systems
can save one money on TCO (simplified installation, management,
maintenance) and create a cluster architecture in which the optimum for
certain classes of problem is likely to occur, so it is relatively easy
to sensibly optimize the application's performance (maybe even across
more than one hardware or network architectures, providing a measure of
portability).

HTH, although it probably won't (since my answer is pretty much that
there is no "answer", only procedures for organizing your thoughts to
search for YOUR particular answer for YOUR particular code on YOUR
particular cluster).

    rgb

> 
> For HPC applications, performance is everything. Next we all know about 
> the famous performance-memory tradeoff which says that performance can 
> be improved by consuming more memory and vice versa. Therefore HPC 
> applications want to consume all available memory.
> 
> But the performance-memory tradeoff as mentioned above supposes infinite 
> memory and infinite memory bandwith. Because memory if finite, consuming 
> more memory as physically available will result in swapping by the OS 
> and therefore a big performance hit. And since BW is also finite and 
> latency we have caching. But now we also need to be cautious not to 
> loose time due to cache trashing.
> 
> Knowing this we could say that HPC applications generally want to eat 
> all available memory but not more. All available memory here means all 
> physical memory minus the physical memory consumed by the system and its 
> basic services because we suppose that HPC applications do not share 
> their processor with other applications (to have the whole cache for 
> itself). Well this is true for single-processor machines. On multi-proc 
> machines (smp,numa) only a part of the physical memory can be consumed.
> 
> So because the application does not know how much physical memory it is 
> allowed to eat, it might be best that the user just specifies it when 
> launching the application.
> 
> But suppose now a single-proc machine has 8GB physical memory. Taking 
> into account that the OS and its services will never take more than 
> 500MB, the user might say to the HPC application that it can eat up to 
> 7.5GB of the physical memory.
> 
> But what does this number mean to the HPC application that is trying to 
> optimise its performance? Should it try to never consume more memory as 
> 7.5GB or should it only try to consume never more as 7.5GB in intensive 
> loops (e.g. in the solver)? In the latter case, can we rely on the OS 
> swapping out the inactive parts of our application to make space for the 
> solver or would it be better that the application puts all 
> data-structures that are not used in the solver on disk to make sure? 
> OTOH if we want to limit the total memory consumption to 7.5GB, would it 
> be best to allocate a memory-pool of 7.5GB and if the pool is full abort 
> the application (after running for days)?
> 
> toon
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu