[Beowulf] mem consumption strategy for HPC apps?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduMon Apr 18 07:54:38 PDT 2005
- Previous message: [Beowulf] mem consumption strategy for HPC apps?
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 18 Apr 2005, Toon Knapen wrote: > It's true that there is no 'general case' both OTOH all developers of > HPC applications may learn a lot of each other. It's a pitty there is > little discussion on HPC application and/or library design, which is of > course OT for the beowulf list (it's just a general remark), except for > a few hot topics (such as beowulf). It's not off-topic, its perfectly on topic. Its just that the GENERAL discussion is too broad to be useful to most people. However, everybody benefits from specific, focussed discussions on this very topic and they occur fairly often if you look at the list archives. I think what people are saying is that "what do you do if your application is so big that it runs out of main memory on a HPC compute cluster" is so broad that it cannot be answered (although I gave it a very general shot in my first response:-). Or rather, the answer is trim it down to fit or prepare to work pretty hard on THIS task as there is no simple answer. However, if you ask instead something like "I'm thinking of splitting up task X that is too big to fit in core on my system to run in parallel on a cluster; is this a good idea or should I rely on VM and swap instead?" then you might get some very focussed and useful responses that suggest things like comparing the latency and bw hits associated with disk, the extra overhead associated with the OS determining what chunks of data need to be swapped in and out, and the IPC costs involved in accessing the same data over a network. You might also get folks that ask you if your TASK can be split up or if only its DATA can be split up -- most people who use clusters use them to split up the actual work not just the data (although there can be benefit from either or both). This would then lead to an entirely constructive discussion, totally on topic, of parallel scaling, how best to measure/estimate the IPC cost of various partitionings and possible network hardware and balance them against your CPU/memory speeds and capacities. This in turn would ultimately help you parallelize your application AND engineer a cluster on which a parallelization would yield linear speedup for at least some range of sizes. So to get useful help, ask specific questions about your specific application. It may be that nobody on-list can help you, if nobody is doing that particular thing or anything "like" that. However, there are lots of things people can/will do to help if you then get very specific about your application's operational cycle, as well as generic resources you can read online on parallel programming and cluster computing (notably Ian Foster's online book on the former and my own online book on the latter, both linked to e.g. http://www.phy.duke.edu/brahma and/or my personal home page). To give you a single example of how things work, let me outline the way a person doing some sort of lattice problem might proceed. For whatever reason, they want to do a lattice of 1024^3 sites. On each site there is a struct containing eight double precision (8 byte) variables. Their primary lattice thus occupies 64 GB of space (plus sundry change for the code itself, operational variables and the like), and they have a cluster of 64 2 GB nodes on which to run it. SO, they say "gee, if I put 1 GB blocks of the lattice on each node, it fits". A bit of arithmetic convinces one that this is blocks 256 to the side (since 4^3 = 64). Now, in order to advance the computation each site needs to communicate with its nearest neighbors. For all the interior sites this is a local memory hit. For all the boundary sites (on all six faces of the partitioned cube) this involves two-way IPCs with the nodes that contains the neighboring cubes. On each face there are 256^2 = 64K sites, eight bytes each is 512 KB, six faces = 3 MB of IPC's per node per step. The COST of this (in time) on (say) 100 BT is at least 100 usec per face to start talking to a node plust 0.5/10 MB/sec = 0.05 seconds, bandwidth dominated. To do all six faces requires 0.3 seconds. If you organize your code carefully you can probably find a pattern that does all of the bidirectional communications in parallel for all nodes in the SAME 0.3 seconds. The other thing to compute is how long it takes to "take a step" on a node, that is, do all the computations required on the GB of data local to the node. If we assume that the site update computation is complicated and involves e.g. evaluating a transcendental function per double precision node variable, chances are good that site updates per node will take order of seconds -- the longer the better from the point of view of parallel scaling. OTOH, it might be that each site update takes only a very small time so that updating an entire node takes only order of one second. Even in the latter case we do 64 1 GB units of work in 1.3 seconds compared to 64 seconds on a big memory machine for near linear speedup. The scaling of the computational burden with memory size and the ratio of surface (for IPCs) to volume (for the computation) gives us near linear speedup until the cost of IPCs is GREATER than the cost of updating each node in parallel. Paradoxically, making the computation smaller here is what shifts it into a poorer scaling regime -- a lattice that was only 10x10x10 would never scale out to 64 nodes because the communications latency alone would likely exceed the computational time per node. It is interesting to contemplate what happens if each site needs to communicate with ALL other sites in the computation. In that case updating any site has to loop over ALL other sites, or basically 64 GB worth of data. In this case a lattice decomposition scales very differently -- each node does a "fast" loop over the local sites and then starts a systematic exchange of data with other sites. Node memory requirements may double as well, as one may have to buffer the initial state of each node's lattice to communicate to all the other nodes, plus provide a buffer for incoming node data. IPCs now involve 32 pairs of 1 GB exchanges or an irrelevant latency hit, 1 GB/10 MB/sec = 100 seconds per pair, 3200 seconds (or almost an hour) per update of all sites. Unless I'm making egregious arithmetic errors again (very possible as longtime list humans know:-). This can STILL be very profitable for several reasons. One is that doing the full 64 GB computation still takes a very long time on our mythical big memory machine. One is now doing a double loop and computation time now scales like the number of nodes squared. The mythical machine probably needs 128+ GB of core just to hold the original and updated data, which is even more outrageous than 64 3-4 GB nodes. All of which favor decomposition. Another is that even if the computation required to update a site is SMALL so that the computation takes 64 x longer or even worse to complete than it would on a mythical 128 GB machines, this is INFINITELY FASTER than it would take to complete on a MYTHICAL 128 GB machine as there is no such beast. History is replete with examples of computations that were performed slowly and tediously because they were valuable enough to justify it even if they didn't "fit" into the "comfortable" regime. Newton's hand computations of orbits, Kepler's long and tedious work with Brahe's observations for example. You may lose horribly on parallel scaling but be able to complete the computation, and almost certainly can complete the computation faster than it would complete swapping even if the code runs fully serially on a single system and the network is used only as a big virtual memory. Finally, we're estimating using 100 BT ethernet, which is an old pokey network. We have more than an order of magnitude improvement to play with at the IPC level, and maybe have more than one PARADIGM to play with at the IPC level, if we change networks. We we could conceivably drop the overall IPC hit to 320 seconds (five minutes per site update). Anyway, this is just an example of the kind of thing the list OFTEN discusses -- if you want to see what people know about your particular problem (which may be nothing:-) then describe your particular problem in considerable detail. To get the most help, do the preliminary estimation "like" that above on your own -- your computation needs to handle N entities, updating each of these entities depends on a subset of M entities distributed in a local/nonlocal pattern, to conduct an "independent" step of the computation based on an instantaneous snapshot of the N entities requires T time per entity, and so on. THEN one can look for partitionings of the data entities with favorable ratios of computation to IPCs, or can at the very least estimate how long a parallelized version will take to run for various cluster architectures and partitionings and scales. Then you can make rational cost/benefit decisions, which is the best anyone can ever do -- maybe your problem "can" be done profitably at the scale you want to run it at, maybe it can't, but you'll fully understand either answer. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] mem consumption strategy for HPC apps?
- Next message: [Beowulf] mem consumption strategy for HPC apps?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
