Beowulf: A theorical approach
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Jun 22 12:08:10 PDT 2000
- Previous message: Infiband (was RE: Beowulf: A theorical approach)
- Next message: Infiniband (was RE: Beowulf: A theorical approach)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 22 Jun 2000, Lyle Bickley wrote: > Thanks Robert for all your comments, but especially those regarding > fault tolerance. You're more than welcome. > Cost/benefit analysis is a very difficult issue. How many Beowulf runs > that take days to complete fail? What is the cost? I wish I had a > better handle on this. It's a LOT easier to understand the cost of the > NY Stock Exchange going down for 20 minutes than a Beowulf failure after > three days.... I'm hoping to tackle this in a chapter in the eternal book I'm working on. Part of the answer is objective, and that part can be explained. In fact, it is mathematically described by e.g. game theory or insurance company actuarial statistics -- one is selecting a strategy to optimize some expected return (maximize benefit or minimize cost) based on your best guess of certain probabilities and cost weights. There are even ways to create a feedback correction cycle and tune to a global optimum based on observed rates of failures and observed costs instead of guesses, if one gets very fancy and it matters. The other part of the answer, as you note, is subjective. What's the "cost" of a beowulf failure after three days? Probably very little, if you are in the middle of a six month project and it doesn't happen again. On the other hand, if you have a publication deadline in two days and needed just one more hour to complete the three day run that would finish things off in time to write them up... I worry about the same thing here during the academic year. During the bulk of the semester a server failure in the physics department is an annoyance, but probably isn't "critical". Every semester, though, there is a ten day or so period where a server failure could literally be a disaster -- when I'm writing my final exams (on the computer) and so is everybody else, or evaluting my gradebooks (on the computer) and so is everybody else. If those go away right before I was going to print out an exam or tally up the grades, the entire academic Universe comes to an ugly end as no final exam can be given in their one and only final exam slot, or their failing grade doesn't get in until after they've graduated. Heads roll. Angry students storm your office carrying torches. So, we do what we can to guard against this -- keep good backups, architect things so there is a replacement box that could be turned into the primary server in a few hours. This costs some money and time but is worth it. On the other hand, what if there is a fire? Can't say that our measures are adequate for that. Insurance for that would involve off-site storage, and in fact I tend to do just that and try to keep my entire CVS tree sync'd between home and work so if a (small:-) meteor landed on the physics building tonight (when I wasn't there) my sources and writings and papers and so forth would survive. Even this wouldn't help if there was a hurricane like Fran -- electricity itself went away for more than a week at my house, and my laptop won't run that long and I can't afford an adequate solar recharger...;-) Backup strategy (the underlying reasoning) is basically the same as failover strategy -- you basically determine the amount of work you are willing to lose given scenario X and work cost/value Y and take preventative measures with that period. You then cross your fingers concerning scenario Z that you can't afford to deal with. After all, even tandems will go down if they are vaporized in a nuclear blast. Unless perhaps they are failover protected at sites separated by (say) several tens of miles and a lot of EMP protection. Military scenarios probably require failover protection at even this level, but most of the rest of us don't. A lot of people doing beowulfish calculations do failover protection of sorts without even knowing consciously that that is what they are doing. For example, why is it a "three day run"? In most cases, one can pick a (scientific) calculation size that will run in an hour, or a day, or a week, or a year (and all would yield interesting results). You pick a size that you can afford and that finishes in a "reasonable" amount of time. Larger sizes wait for Moore's Law to catch up to them. What's reasonable? A size that you're pretty sure will finish before a system is likely to fail, which may be as low as the interval between area thunderstorms in the summer (this was the case at my house before I installed UPS on everything). In many cases a one can do better -- For example, it may be possible to do a year's worth of calculation safely by breaking it into chunks completed a day at a time, or a week at a time, without having to really "checkpoint" the code. In Monte Carlo, for example, one can just run a large number of independent simulations and do stats to recombine the results. One even gains from doing this as the variance of the truly independent runs is an absolutely reliable measure of error in the mean (which isn't generally the case for the variance generated by importance sampling a single Markov chain with internal autocorrelation times, but I digress:-). I personally try to time things so that chunk completion times are on the order of one day, because I'm always willing to lose a day's worth of compute time as long as it doesn't happen too often. Sometimes I've gone as high as a week. I basically NEVER do three week long runs if there is any way to rearrange things so I don't have to -- systems don't break, linux rarely fails, but somehow "something" (lightning, human error, power fluctuations, somebody tripping over a cord) not infrequently intervenes somewhere within the timeframe of months. This very coarse chunking of work is all the "failover checkpointing" that I (or, I suspect most beowulf folks) do, and it works quite effectively, although I'm sure that it isn't always possible to coarsely chunk like this without writing a lot of nasty code to save a truly restartable checkpoint state... > > At a guess, this is the kind of problem that will -- eventually -- be at > > least partly addressed by work being done at a number of places. I > > believe that there is at least one group working on certain core pieces > > of software that will build beowulf support directly into the kernel, > > where it can benefit from increases in speed and efficiency and where > > one can BEGIN to think about issues like fault tolerance at a lower > > level than program design. This is the kind of thing the "true beowulf" > > computer science groups think about. > > > > I have been considering the possibility of a single Tandem like system > which is TRULY fault tolerant, bringing "true" fault tolerance to an > entire Beowulf cluster via heartbeats, progress monitoring, process > checkpoints, etc. > > But who would buy such a critter?? Is there really a need?? What > percentage of the total cost of a Beowulf would be a reasonable cost for > such a beast?? Well, Tandem systems do sell, of course, so there is a market for this kind of fault tolerance. The military might even need it on a small scale -- a tank might be made more robust if its battle computer was really a fault tolerant beowulf networked to four or five hard sites within the tank. A non-fatal hit might take out one or two nodes, but not the whole thing. Ditto the space program (plagued with failures already and with a very high cost of failure). Financial markets and webservice markets both have a high cost of failure. Something like an EMS computer system supporting a 911 center cannot afford to go down in any dimension, even during a natural or unnatural disaster. In many of these cases, the people buying the fault tolerance have DEEP pockets and the cost of failure is VERY high. However, their needs are also very, very specific, so one has to basically simultaneously engineer the system and the software to match. The one thing bringing this sort of fault tolerance to beowulfery (at the systems level, with open source components and COTS hardware) would do is significantly lower the cost of the dedicated/custom software development. I think that is the goal of some of the folks working on the problem. A very interesting subject, I agree. Go for it. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Infiband (was RE: Beowulf: A theorical approach)
- Next message: Infiniband (was RE: Beowulf: A theorical approach)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
