Beowulf: A theorical approach
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Jun 22 09:20:19 PDT 2000
- Previous message: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
- Next message: beowulf apps for bioinformatics
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 22 Jun 2000, Nacho Ruiz wrote: > Hi, > > I'm doing my final year project and I'm writting about the Beowulf project > an the Beowulf clusters. > I've been reading several documents about the beowulf clusters, but I would > like to ask all of you some questions about them. > > As I've seen the main objective behind any Beowulf cluster is the > price/performance tag, specially when compared to supercomputers. But as > network hardware and commodity systems are becoming faster and faster > (getting closer to GHz and Gigabit speeds), could you think on competting > directly with supercomputers? They already do, in many arenas. Greg Lindahl has given some wonderful talks where he has showed alpha/myrinet beowulves that outperformed a number of the big iron systems. His company (and some other turnkey beowulf companies) are winning big contracts because their systems aren't just cheaper, they are also in many cases faster AND cheaper. > As I see it the Beowulf cluster idea could be based in the distributed > computign and the parallel computing: you put more CPUs to get more speedup, > but as you can't have all the CPUs in the same machine you use several. So > the Beowulf cluster could fit in between the distributed computing and the > supercomputers (vetorial computers, parallel computers,..etc). You have > advantages from both sides: parallel programming and high scalability; but > you also have several drawbacks: mainly interconection problems. Do you > think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4) > and massive primary and secondary memory devices at low cost, you could > have a chance to beat most of the traditional supercomputers? or is not your > "goal"? You should skim over some of the talks and documents on http://www.phy.duke.edu/brahma (near the top). In particular, read the sections that describe the scaling of parallel computer performance (Amdahl's Law and various improved estimates thereof). Then meditate upon the fact that many of the big iron supercomputers have what amounts to a beowulf architecture, even if they are SMP in that all the processors reside in a single box (that is, they still use a de facto "network" to communicate). It might help to read Pfister's book "In search of clusters", especially his discussion of (CC-)NUMA to see the primary alternative, where shared memory is used to communicate between processors/tasks. It would be worthwhile for you to check out the Trapeze project as well, which seeks to extend virtual memory over a "beowulf-style" network architecture, exploiting the fact that a network connection, however slow, is still three orders of magnitude faster than disk access (so swapping or paging to EVEN an NFS-mounted RAM disk on a second system might well be considerably faster than swapping or paging to a real disk in the same cabinet -- emphasis because NFS is not a particular fast or efficient protocol). Finally, recognize that there is no "goal" shared by all the people on this list other than getting our work done. We are process oriented, not goal oriented;-). For nearly everybody on the list, beowulfs ALREADY "beat" traditional supercomputers in one or more critical dimensions of the highly multidimensional cost/benefit decision all humans with work to do face when trying to decide how to go about doing it. If my goal is to surf the web, I don't buy a Cray, I buy a PC. If my goal is to complete as many independent Monte Carlo simulations as possible per unit time per dollar, I ALSO don't buy a Cray, I buy a network of PC's organized into a "beowulf". If my goal is to solve CFD problems, weather problems, cosmology problems (closer to the "grand challenge" level) then maybe I buy a Cray (or whatever) or maybe I build or buy a beowulf of advanced and competitive design (which is likely to cost a lot more than a simple network of PC's, but a lot less than the Cray). Even the required time to problem completion matters -- if I >>must<< solve the problem as rapidly as possible whatever the cost I'll probably pick a different system than I'd pick if my goal is to solve the problem as rapidly as possible given a fixed budget of X. Driven by a mix of the overwhelming cost-benefit advantage of beowulf/cluster supercomputing for MANY problems and the fact that playing with this in the open source world is damn good and interesting and useful computer science (Bell prizes have been awarded for contributions in beowulfery) beowulfs have evolved into the supercomputer architecture of choice for thousands of sites, many of them "tiny" by comparison with the big sites. I have a five CPU beowulf in my >>home<< office (might get it up to eight this year, with luck and another $1.5K or so invested). A joke by the standards of a T3x or SPx, but even so, I get excellent performance on some problems I'm interested in AND it gives me a very convenient laboratory for my amateurish forays into computer science. Lots of physics, or chemistry, or computer science, or engineering departments in universities have built small 16-32 node beowulfs. This needs to be compared to the far more serious laboratories run by Don Becker, Erik Hendriks, et. al. at Scyld.com, Greg Lindahl's systems, the systems run by e.g. Walter Ligon and Rob Ross at Clemson, the thousand node system doing genetic code development (at Stanford?) and more, where they focus on the underlying computer science and development of advanced infrastructure and software support. Between the high road (interesting computer science that might enable interesting problems to be tackled one day) and the low road (USEFUL results from the interesting computer science being applied to solve real world problems right now) beowulfery has a "mind of its own" and a unique pattern of evolution. Really, the way the beowulf "movement" has proceeded is a fascinating concept and well worth writing about, but don't start off by ascribing to it a fixed goal. It is an amalgam, a hodgepodge, it is like damascus steel with soft grains of iron intermixed with hard grains of high carbon cementation producing something amazingly tough >>and<< flexible. Its evolution pattern involves the open exchange of ideas, the hard nosed acceptance or rejection of those ideas on the basis of the economic benefit that derives from them, the alteration of bad ideas into good ones and good ones into better ones as clever idea are batter around in between smart people. Ahhh, but I digress. Or is it regress. In a minute I'll be spouting poetry, "Ode to the Beowulf...";-) > And about the evolution of the Beowulf clusters, do you all follow a kind of > guideness or the project have divided in several flavors and objectives? As I said, there isn't really a "project". There isn't even a "consortium" or "IEEE standards committee". There is only a website (really a LOT of websites) and an informal, unmoderated list that anyone can join (and leave again if it doesn't suit them). Well, it is moderated by common consensus and occasional flames -- anybody who gets egregiously off-topic or out of line gets razzed or roasted or both. Or worse, just ignored. There are lots of places (to the list and on websites) where successful solutions are posted and documented, and more are being developed every day. People participate because they want to, because they have a >>use<< for the idea. The process is stabilized by those folks that have participated for a long time and by now are invested in beowulf-oriented development as part of their research, their business plan, their professional focus, or just their hobby. There are a number of folks on the list who have been doing one sort of distributed parallel computing or another for a LONG time (which in this business is anything over five years:-) including some of the original inventors of the term "beowulf". The only two flavors that have emerged on the list that are worth mentioning are the "true beowulf" flavor, which is by definition a collection of COTS computers interconnected by a COTS (and usually private) network with (usually) a single "head". The idea is that a "beowulf" is a supercomputer assembled out of COTS parts, and so it can be given a single "name", usually the hostname of its head node, and the internal nodes are viewed as "parts of the supercomputer" and are not generally accessed or utilized as separate compute entities. However, many (possibly even most, I don't know) of the folks on the list are interested in or use more general "clusters" of computers -- things that have been dubbed NOWs, COWs, POPs -- or hybrids, where there is a cluster that is architected much like a beowulf and used for relatively fine grained synchronous tasks but it is part of and node-accessible from a larger network/cluster that might be viewed as a NOW/COW/POP or just a plain old LAN where coarse grain or embarrassingly parallel tasks can be farmed out and harvested. Both groups have similar problems to solve and share a common language of IPC's, latencies, bandwidths, communication patterns, and so forth. Tools developed for use in one context can often be used in the other. I'd say that the computer scientists tend to focus on the "true beowulf" side of things (as that is where the more interesting and tougher problems are) and the stuff they develop filters down into the more general cluster world as appropriate to a task. These two groups generally coexist amicably on the list, provided that one doesn't try to call a sloppy old compute cluster (or anything running WinNT or Win2k) a "beowulf";-). > Are the objectives of the beggining the same as today or now you plan to > have something like a "super SMP computer" in a distributed way (with good > communications times). I've seen that a lot of you are focusing in the GPID > and whole machine idea, do you think that is reachable? What are the main > objectives vs the MPI/PVM message passing idea? > And what about shared memory (in the HD level or the RAM level), do you take > advantage of having this amount of resouces? I could write a book to answer all these questions (the ones that in fact can be answered). Hopefully the discussion above indicates that many of them don't have answers (or have obvious/silly answers -- of course the objectives change with time as the evolution of hardware and software opens up new possibilities). There are groups doing work on most of the things you mention. Many of those groups are not directly engaged with "beowulfery" and may not even have a member on the list, but that doesn't stop the work they do from percolating in, as long as it is "open". "Open" is as much a part of the definition of the beowulf as "COTS". > Is this idea trying to reach the objective of making parallel programs > "independent" to the programmer? I mean, that instead of having to program > having in mind that you are using a parallel machine you can program in a > "normal" way and the compiler will divide/distribute the code over the > cluster. Is this reachable or just a dream? Is somebody working on this? I personally don't even think that it is a dream at this point -- it is a fantasy. The space of possible program parallelizations and reorganizations is a >>very, very complex one<<. Sure, one can write parallel libraries that will autoparallelize some simple "atomic" operations (like multiplying two vectors or doing a sort). However, to >>optimally<< parallelize (or even execute as a single threaded task) even something this "simple", one's compiler/library simply has to know all sorts of things about the parallel computer on which they are to be run. How fast are the IPC's? Where are the L1 and L2 cache boundaries? How fast is memory? What's the cost of a context switch? This problem is difficult enough on an SMP system, where many of the answers are homogeneous and can in principle be made known to a parallel compiler -- in a beowulf, where there are no guarantees of homogeneity or even a common standard of design it is nearly "impossible". Nevertheless, I wouldn't say that nobody is working on this. The problem is that they are working on the tools that will enable the tools that will enable the tools to be built that MIGHT one day permit at least the efficient and automatic parallelization of a small subset of the standard operations one might wish to perform -- e.g. matrix operations and sorts and the like. Even then I'd guess that the programmer will have to be aware that they are programming for parallel operation as there are issues like data organization and separability that I doubt any compiler can manage. Consequently I think that the day will never come when one can take, for example, the off-the shelf source code to, say, netscape or ls or sort, and compile it with the -parallel flag and produce a binary that will automatically parallelize itself at all, let alone efficiently. The compiler will also be utterly unable to make the most important decision of all -- that it is STUPID to parallelize netscape or ls, STUPID to run sort in parallel for small problems, but that it MIGHT be smart to parallelize sort for problems bigger than some systems and network speed dependent threshold! > And what about the administration of a cluster. Having all the machine of > the cluster under control, so you can know which are avaliable to send some > work, is an hazarous task but necessary. Is not as easy as in a SMP machine > where you know or assume that all the CPUs inside are working, in a cluster > you can't do that as the CPU might work but the HD, NIC or memory may fail. > How much computational time do you spend in this task? There's somebody > working in a better way to manage with this? > > I know that sometime ago HP had a machine woth several faulty processors > working and achiving high computational speeds without any error. They used > some kind of "control algorithm" that manages to use only the good CPUs. Do > you have something like this or there is no point? Does it make sense? There are plenty of people working on adminstrative tools, and a number of tools already exist to solve many of the problems you mention (although the tools may still need work). Fault tolerance is a whole different question. There are certainly folks interested in the problem and are very likely people working on it, but fault tolerance is >>also<< a cost-benefit problem and it has two general KINDS of solutions. One is to engineer the tolerance into the underlying systems architecture. Dual power supplies. RAID 5 run by dedicated controllers. ECC memory. This is VERY difficult to extend to beowulf architectures and isn't easy even on SMP systems -- it is hard to design a computer system that cannot be brought down in its entirety by >>any<< failure of its parts, especially when one of the parts in question is a CPU. The second approach is to engineer the tolerance into the software using the hardware you've got. In many cases this involves checkpointing the code, one way or another (whether the checkpoint goes to memory, to disk, to another system is almost irrelevant). The point is that code checkpointing takes quite a lot of time regardless of the medium compared to the work that would be done in that time, and that this time reduces the efficiency of the program. This is really true even if the redundancy is engineered in at the systems level, but there clever engineering can sometimes hide the extra work or do IT in parallel. One then has to examine the cost-benefit equation. It costs you X amount of time to checkpoint at some frequency. During the intervals you are at risk, with some probability. If the probability of failure is low, the intervals that lead to the smallest expected value for the time to completion will be large, often "infinite" (you should just run the damn program and not bother checkpointing, and risk having to run it over again once every five thousand or so times, because checkpointing it even once costs you one part in a thousand in performance). In other cases (for example, when you're going to run a program that takes a year to complete on 1000 nodes at once) the interval may need to be relatively short just to ensure that the problem EVER completes. Confronted with this, for many folks on the list engineering fault-tolerance into their programs is a total waste of time and a cost-benefit loss. One can write fault tolerant parallel software NOW with existing tools, but it is a lot of work and will slow down your job, possibly quite a bit. Failover is usually more interesting to folks who use clusters in different ways than are discussed on the list -- for example corporate parallelized database servers or really big (multisystem) webservers like yahoo. In these cases, the "cost of failure" (even one failure every umpty-ump days) may be so high as to be "unacceptable", even compared to the relatively high costs of fault tolerance. These operations aren't really "beowulfs" although of course they are "close" and their operators/designers would be welcome on the list. At a guess, this is the kind of problem that will -- eventually -- be at least partly addressed by work being done at a number of places. I believe that there is at least one group working on certain core pieces of software that will build beowulf support directly into the kernel, where it can benefit from increases in speed and efficiency and where one can BEGIN to think about issues like fault tolerance at a lower level than program design. This is the kind of thing the "true beowulf" computer science groups think about. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
- Next message: beowulf apps for bioinformatics
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
