[Beowulf] mem consumption strategy for HPC apps?

Robert G. Brown rgb at phy.duke.edu
Mon Apr 18 04:13:44 PDT 2005


On Sun, 17 Apr 2005, Toon Knapen wrote:

> Mark Hahn wrote:
> >>What is the ideal way to manage memory consumption in HPC applications?
> > 
> > 
> > run the right jobs on the right machines.
> > 
> 
> But because memory is scarce one needs to have a good memory consumption
> strategy. And memory is scarce otherwise out-of-core solvers (like for
> instance used in NASTRAN) would not be necessary.
....
> direct solvers treating 1 million dofs (and a decent bandwith of course)
> need a _lot_ of memory. Thus: out-of-core solvers are necessary.
.... 
> I'm assuming at least 1G also. But most have 4G per node and up. But 
> again this is for direct solvers of big systems.
....
> The question is just: any out-of-core uses blocking and treats block per 
> block. But how big should blocks ideally be. Can I take a block-size 
> that is almost equal to my physical memory and thus relying on the rest 
> of the app being swapped out (taking into account that bigger block size 
> improves performance)?
....
> but a typical timeslice is much shorter than 100 seconds. Additionally 
> you're not taking into account the time you loose the time you need to 
> rebuild your cache.
....
> OK, thanks. This was one of my main questions. So as you said before: 
> the OS swapping an HPC app is a non-fatal error.

These responses are not inconsistent with Mark's or Greg's remarks.  Let
me reiterate:

  a) The "best" thing to do is to match your job to your node's
capabilities or vice versa so it runs in core when split up.  This is
just ONE of MANY "best practice" or "design constraint" issues to be
faced though, as it is equally important to match up networking
requirements, memory access speeds, and a lot more.  There is some sense
in the HPC community of how to go about doing these things, and
fortunately for MOST cluster owner/builders, it isn't impossibly
difficult to achieve a profitable operating regime (near linear speedup
in number of nodes) if not an optimal one.

  b) IF your job is TOO BIG to fit in core on the nodes, then there IS
NO "BEST PRACTICE".  There is only a choice.  Either:

   Scale your job back so it fits in core.  Seriously.  Most of us who
run things that COULD fill a universe of RAM recognize that we just
plain have to live in the real world and run things at a size that fits.
Fortunately we live in a Moore's Law inflationary universe, so one can
gradually do more.

or

   Bite the bullet and do all the very hard work required to make your
job run efficiently with whatever hardware you have at the scale you
desire.  As you obviously recognize, this is not easy and involves
knowing lots of things about your system.  Ultimately it comes down to
partitioning your job so that it runs IN core again, whether it does so
by using the disk directly, by letting the VM subsystem manage the in
core/out of core swapping/paging for you, by splitting the job up across
N nodes (a common enough solution) so that its partitioned pieces fit in
core on each node and relying on IPCs instead of memory reads from disk.

Really that's it.  And in the second of these cases although people may
be able to help you with SPECIFIC issues you encounter, it is pointless
to discuss the "general case" because their ain't no such animal.
Solutions for your problem are likely very different from a solution for
somebody partitioning a lattice which might be different from somebody
partitioning a large volume filled with varying number and position of
particles.  An efficient solution is likely going to be expensive and
require learning a lot about the operating system, the various parallel
programming paradigms and available support libraries, the compute,
memory and network hardware, and may require some nonlinear thinking, as
there exist projects such as trapeze to consider (using other nodes' RAM
as a virtual extension of your own, paying the ~5 us latency hit for a
modern network plus a memory access hit, but saving BIG time over a ~ms
scale disk latency hit).

Anyway, good luck.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list