[Beowulf] mem consumption strategy for HPC apps?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Toon Knapen toon.knapen at fft.beFri Apr 15 04:32:34 PDT 2005
- Previous message: [Beowulf] mem consumption strategy for HPC apps?
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Robert G. Brown wrote: > On Thu, 14 Apr 2005, Toon Knapen wrote: > > >>What is the ideal way to manage memory consumption in HPC applications? > > I don't think that there is any simple answer to your question, because > of the wide variability of hardware upon which one can run and the > numerous boundaries where superlinear/nonlinear speedup or slowdown can > occur. > > ATLAS (and other self-adaptive numerical algorithms) are a case in point > -- ATLAS goes through a very complex build/search process to tune just > to the primary memory subsystems on a single host -- registers, L1 > cache, L2 cache, and main memory. We rely on optimised BLAS libraries because these have generally the best possible optimisation (by means of blocking) for the different cache levels. But ATLAS looks to main memory and caches. It however does not perform out-of-core operations nor needs to allocate temporary buffers (although some lapack routines it implement might do so). ATLAS thus supposes that there is enough physical memory to hold all the data. When working with sparse matrices, this assumption does not hold anymore. But let us not consider sparse matrix algorithms for the moment. Instead let us look at integrating ATLAS in a Real World Application. ATLAS supposes all its data are at least in physical memory. Suppose now that my app's footprint is bigger than the available physical memory. Whenever I call an ATLAS routine, can I rely on the OS to swap out the pages that will not be used by the ATLAS algorithms. Is the OS good at deciding which pages to swap out. Or have some of you experience that storing all 'inactive' parts of your application on disk yourself improves performanc drastically. > Note that this is the same problem that you are addressing HPC > applications; HPC just adds additional layers to the memory hierarchy, > with a range of latency and BW costs to e.g. copy to or read from memory > locations on remote nodes vs the full range of local costs described > above (or in a message passing model, the time required to send and/or > receive blocks of data as messages). Outside of both hierarchies there > are the costs (in time) of using virtual memory in the form of local > disk, which are generally orders of magnitude slower than the slowest > form of local memory. I >>think<< that disk-based VM is still an order > of magnitude or two slower than accessing remote system memory > (either/any model) over the network, but networks and disks keep > changing and I'm not so certain of those times any more. Agreed. But let's focus for now on the (virtual) memory handling on a one-processor machine because my questions regarding clusters will be influenced by the answers for one-processor machines. > > The issue of partitioning your application is further complicated by > CLUSTER variability -- some clusters have 1 GB/CPU nodes, some have 4 > GB/CPU nodes, some have 250 MB/CPU nodes, some (lucky) clusters have the > full 8 GB/CPU you describe. MMUs differ. Newer CPUs (especially the > dual cores just starting to emerge) have very different basic memory > architectures, and there is significant variation between Intel and AMD > CPUs at nominally equivalant clock(s) and data path widths. > Agreed but I'm trying to look at the issue step by step: first for a uni-processor machine, next a homogeneous cluster and finally a heterogeneous cluster. >>From the above it should be clear that writing a fully tuned "big memory > footprint" HPC application is something that: > > a) Probably will need to be done for a >>specific<< cluster (just as > ATLAS is (re)built for >>specific<< CPU/memory architectures). Note > that there is a clear trade-off of specificity (and local optimization) > and portability which can significantly increase TCO down the road when > you have to effectively rewrite the application for a new/different > cluster. Writing something automatically portable to a radically > different architecture is No Mean Trick! Which is why I continue to > think of ATLAS as a toolset a decade or so ahead of its time... Considering most time is spend in BLAS operations, one just needs to use an optimised BLAS library on the different architectures. But since BLAS is totally in-core and all OS'es have VM, I wonder how the memory footprint of my app will affect the performance of BLAS? <snip 4 other interesting remarks> I know, I'm currently focusing on a sequential machine because I want to understand the 'best practice' on one machine (but independent of the OS/vendor as much as possible) before looking at the (heterogeneous) cluster-picture. I hope this is not considered OT for this list (I'm sorry if it is). Thanks for the interesting response already, toon -- Check out our training program on acoustics and register on-line at http://www.fft.be/?id=35
- Previous message: [Beowulf] mem consumption strategy for HPC apps?
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
