[Beowulf] mem consumption strategy for HPC apps?

Fri Apr 15 04:32:34 PDT 2005

Robert G. Brown wrote:
> On Thu, 14 Apr 2005, Toon Knapen wrote:
> 
> 
>>What is the ideal way to manage memory consumption in HPC applications?

> 
> I don't think that there is any simple answer to your question, because
> of the wide variability of hardware upon which one can run and the
> numerous boundaries where superlinear/nonlinear speedup or slowdown can
> occur.
> 
> ATLAS (and other self-adaptive numerical algorithms) are a case in point
> -- ATLAS goes through a very complex build/search process to tune just
> to the primary memory subsystems on a single host -- registers, L1
> cache, L2 cache, and main memory.  

We rely on optimised BLAS libraries because these have generally the 
best possible optimisation (by means of blocking) for the different 
cache levels. But ATLAS looks to main memory and caches. It however does 
not perform out-of-core operations nor needs to allocate temporary 
buffers (although some lapack routines it implement might do so). ATLAS 
thus supposes that there is enough physical memory to hold all the data.
When working with sparse matrices, this assumption does not hold anymore.

But let us not consider sparse matrix algorithms for the moment. Instead 
let us look at integrating ATLAS in a Real World Application. ATLAS 
supposes all its data are at least in physical memory. Suppose now that 
my app's footprint is bigger than the available physical memory. 
Whenever I call an ATLAS routine, can I rely on the OS to swap out the 
pages that will not be used by the ATLAS algorithms. Is the OS good at 
deciding which pages to swap out. Or have some of you experience that 
storing all 'inactive' parts of your application on disk yourself 
improves performanc drastically.

> Note that this is the same problem that you are addressing HPC
> applications; HPC just adds additional layers to the memory hierarchy,
> with a range of latency and BW costs to e.g. copy to or read from memory
> locations on remote nodes vs the full range of local costs described
> above (or in a message passing model, the time required to send and/or
> receive blocks of data as messages).  Outside of both hierarchies there
> are the costs (in time) of using virtual memory in the form of local
> disk, which are generally orders of magnitude slower than the slowest
> form of local memory.  I >>think<< that disk-based VM is still an order
> of magnitude or two slower than accessing remote system memory
> (either/any model) over the network, but networks and disks keep
> changing and I'm not so certain of those times any more.

Agreed. But let's focus for now on the (virtual) memory handling on a 
one-processor machine because my questions regarding clusters will be 
influenced by the answers for one-processor machines.

> 
> The issue of partitioning your application is further complicated by
> CLUSTER variability -- some clusters have 1 GB/CPU nodes, some have 4
> GB/CPU nodes, some have 250 MB/CPU nodes, some (lucky) clusters have the
> full 8 GB/CPU you describe.  MMUs differ.  Newer CPUs (especially the
> dual cores just starting to emerge) have very different basic memory
> architectures, and there is significant variation between Intel and AMD
> CPUs at nominally equivalant clock(s) and data path widths.
>

Agreed but I'm trying to look at the issue step by step: first for a 
uni-processor machine, next a homogeneous cluster and finally a 
heterogeneous cluster.

>>From the above it should be clear that writing a fully tuned "big memory
> footprint" HPC application is something that:
> 
>   a) Probably will need to be done for a >>specific<< cluster (just as
> ATLAS is (re)built for >>specific<< CPU/memory architectures).  Note
> that there is a clear trade-off of specificity (and local optimization)
> and portability which can significantly increase TCO down the road when
> you have to effectively rewrite the application for a new/different
> cluster.  Writing something automatically portable to a radically
> different architecture is No Mean Trick!  Which is why I continue to
> think of ATLAS as a toolset a decade or so ahead of its time...

Considering most time is spend in BLAS operations, one just needs to use 
an optimised BLAS library on the different architectures. But since BLAS 
is totally in-core and all OS'es have VM, I wonder how the memory 
footprint of my app will affect the performance of BLAS?

<snip 4 other interesting remarks>

I know, I'm currently focusing on a sequential machine because I want to 
understand the 'best practice' on one machine (but independent of the 
OS/vendor as much as possible) before looking at the (heterogeneous) 
cluster-picture. I hope this is not considered OT for this list (I'm 
sorry if it is).

Thanks for the interesting response already,

toon

-- 
Check out our training program on acoustics
and register on-line at http://www.fft.be/?id=35