[Beowulf] mem consumption strategy for HPC apps?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at physics.mcmaster.caFri Apr 15 12:16:16 PDT 2005
- Previous message: [Beowulf] mem consumption strategy for HPC apps?
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> What is the ideal way to manage memory consumption in HPC applications? run the right jobs on the right machines. > For HPC applications, performance is everything. Next we all know about > the famous performance-memory tradeoff which says that performance can > be improved by consuming more memory and vice versa. Therefore HPC > applications want to consume all available memory. this is simply untrue. the HPC apps I deal with most fall into two broad categories: - montecarlo-type stuff, which tends to be incredibly small, even cache-resident, and which certainly does NOT want or need more memory. - physically-based simulations (cosmology, condensed matter, materials, etc), which has a very clear memory requirement based on the model being sumulated. MC-class stuff is almost insignificant in memory use and scales linearly. physically-based simulations tend to be limited by work, not space - yes, you need enough memory, but world-class problems will be pushing the boundaries of cpu/interconnect speed, not so much memory size. it's useful to state up-front what sizes we're talking about. I say that 1GB per cpu is the lower bound of what's reasonable today - just in terms of packaging (2x512M dimms per cpu). admittedly, Intel's approach might get by with less. but my biggest users are reasonably happy with 4GB/p today, maybe a little eager for 8GB/p tomorrow. that's reasonably congruent with where the market is (4x1G dimms per opteron, for instance.) MC people with very simple models are around 4MB/p (that's mega), and people with larger models (MC or not) are at around 400 MB/p. > But the performance-memory tradeoff as mentioned above supposes infinite > memory and infinite memory bandwith. Because memory if finite, consuming > more memory as physically available will result in swapping by the OS swapping in HPC is a non-fatal error condition. note that I also didn't mention "old-fashioned" applications which use tons of disk space to cache partial results. I would argue that the "old-fashioned" jeer is at least partially justified by the assumption of disk-intensive apps that they can't scale NCPUS. yes, you can dress these apps up as "out-of-core", but I'm not so sure they really make sense today. > Knowing this we could say that HPC applications generally want to eat > all available memory but not more. All available memory here means all I don't believe this is a useful generalization. applications have a "natural" size. applications which are blindly scaled up in data (without corresponding scaling of NCPUS) are highly suspect, IMO. > basic services because we suppose that HPC applications do not share > their processor with other applications (to have the whole cache for > itself). don't freak out about caches! a cache flush is only a millisecond or so, which means that it's entirely reasonable to timeslice applications on a fairly coarse granularity (theoretically, even just a few seconds would be enough to amortize the flush.) admittedly, I'm ignoring the difference between a cache's worth of isolated fetches and a (streaming) flush, but back-of-envelope numbers indicate this is not a big problem. the real issue is that tight-coupled apps need gang scheduling. > Well this is true for single-processor machines. On multi-proc > machines (smp,numa) only a part of the physical memory can be consumed. depends on the machine. even on an altix with 9M caches, flushing would take ~2ms, so if you do it every 100 seconds, no one will notice. consider an opteron with 1M cache and 3 GB/s easily sustainable BW and 50 ns latency. .3 ms for a streaming flush and .8 ms line-by-line. > loops (e.g. in the solver)? In the latter case, can we rely on the OS > swapping out the inactive parts of our application to make space for the > solver or would it be better that the application puts all > data-structures that are not used in the solver on disk to make sure? the OS has mediocre insight into which pages you really should have resident vs on disk. you probably *can* arrange your access patterns to be very simple (a single window moving through a large span of memory) which the VM will approximate. but you're probably better off doing it yourself. I question whether this approach makes that much sense, though, since CPUs are *not* all that expensive, relative to large amounts of ram, and especially when considering the speed of disk. > OTOH if we want to limit the total memory consumption to 7.5GB, would it > be best to allocate a memory-pool of 7.5GB and if the pool is full abort > the application (after running for days)? depends on how effectively the VM can approximate the working set. I say kill the application when its %CPU drops too low due to thrashing. and go talk to the job's owner to figure out a better way/place to run it. regards, mark hahn.
- Previous message: [Beowulf] mem consumption strategy for HPC apps?
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
