[Beowulf] Re: Cooling vs HW replacement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Jan 27 11:07:51 PST 2005
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 27 Jan 2005, Josip Loncaric wrote: > Karen Shaeffer wrote: > > [...] > > > > If DDMs were interested in helping customers discriminate based on the > > actual expected lifetime of drives, they would all publish running infant > > mortality rates, updated weekly, during the production run of their disk > > drives. Afterall, this is the one metric the entire organization is focused > > on during production. But, what they hand out is this MTBF number to > > prospective customers. A number they pay no attention to internally. > > Karen's excellent introduction to the logic of disk drive manufacturing > (DDM) is well worth reading -- particularly since the same factors drive > other computer manufacturers: rapid product cycles, insane time > pressures, thin profit margins, limited opportunity to prevent > financially ruinous mistakes, etc. Agreed. Been there, been burned (which is why I keep urging caution about using published numbers from the mfr as a sound basis for engineering without a grain of salt, especially to people with relatively little experience in this arena -- cluster newbies). > Therefore, deciding which drive model (or other component) to use fits > under the topic of optimal decision making under uncertainty -- which is > a standard part of game theory, often used in operations research, etc. > > Making rational choices, which can withstand scrutiny even when things > unexpectedly go wrong, is not just an art. There is theory to build on. This is also an excellent contribution to the discussion. In particular, I'd urge people building large clusters to consider the benefits of insuring some of the risks, which is what humans generally do when confronted with the same problem in the arena of human affairs. In a large cluster, the economic consequences of a massive component failure (however common or rare that might be) can be devastating to the project, to careers, to productivity. This is a classic component of game theory applied to real life and is the fundamental raison d'etre for the insurance industry (and why I keep referring to actuarial data). One piece of "insurance" is obviously the base warranty of each component, but this generally protects you only partially from the actual cost of the replacement hardware itself if a component fails. You still take a major hit in productivity and diversion of opportunity cost labor associated with downtime and repair. Whether or not this additional cost is affordable depends to a certain extent on luck, to a certain extent on the "value" of your project. Speaking from bitter personal experience with the Tyan 2460 and 2566 motherboards (as well as anecdotal experiences with various other system components such as drives, riser cards, cases, case fans, CPUs and CPU fans (OEM AMD Athlon MP fans in particular) things DO break in mass "catastrophic" bursts a lot more often than MTBF numbers or even warranties would lead you to expect, and this cost can be quite high and can drain resources and energy for years (until the hardware is finally aged out and replaced) or require an immediate infusion of much money for immediate replacements, or in our case (where the replacements were themselves a problem, albeit a lesser one), both. Practically speaking, hardware "insurance" often means considering extended and/or onsite warranties -- effectively betting someone that your systems will break for some percentage of their original cost (generally ballpark 10% for 3 years). Extended service has two valuable purposes -- one is that it obviously directly protects you from bearing the brunt of the cost of anything from the normal patter bathtub-bottom failures during the normal lifespan up to mass failures or higher than expected normal-lifespan failures during the period that the cluster is expected to be productive. Other forms of insurance against catstrophic failure (such as fire or theft insurance and surge protection and door locks) exist as well, although they tend to be purchased outside of the engineering/operations loop. Insurance via extended warranty addresses the paradox of mass failure (one that might "kill" you or at any rate your project). Even though it is often (or even generally) cheaper in terms of expectation value of the total cost to build a DIY cluster and self insure, excessive (unlucky) failures are far more likely to be "fatal". One major complaint against the HMO industry is that capitation (giving a physician X dollars per head for a group of patients up front while obligating the physician to treat all of that group who get sick) is that it exposes those physicians to the risk of catastrophy in the event that a plague comes along and strikes the group. It is anti-insurance (the passing of risks back to small groups for which the fluctuations can be fatal rather than assuming the risks spread out over a large group where it is more predictiable). It costs you a bit more to insure with a larger group (even with its more predictable risks), but the benefit you gain is that you'll stay "alive" no matter what if you can afford the insurance itself in the first place. Practically speaking, since additional cost means fewer nodes, you can choose to definitely get 10% less work done over the lifetime of your project but ensure that you have a very small chance of getting only 50% or 30% of the work done (or face massive out of pocket costs) due to catastrophic failure downtime. If things go well of course you lose -- maybe over the same interval you only lose 5% of your nodes -- and MOST of the time, one expects things to go well, which encourages people to assume the risk and gamble that things will go well. The second is that hardware backed by an onsite service contract and a company that assumes much of the risk is more likely not to fail in the first place. That company has a strong incentive to protect >>their<< risk in the venture by passing as much as possible back to the mfrs (even at an additional cost) and to perform additional testing and system engineering without a disincentive to uncover "bad" components after (as Karen points out) it is more or less too late to do anything other than sell them off as best you can and take your lumps. The company also typically has some actual clout with the manufacturers and can dicker out deals that further minimize their (and your by proxy) risk, both in terms of getting a premium selection of hardware and of getting better warranty terms per dollar spent. Deciding your optimum comfort level of risk taking is not easy -- partly it is subjective, partly it can be made objective if you can assign a dollar "value" to your time and the up time of your cluster. Even humans (with their relatively low failure rate during their "prime years") tend to buy insurance during this period because even if failure rates are low, the consequences to your family and loved ones of a failure are very high. rgb > > Sincerely, > Josip > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
