How do you keep clusters running....
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Velocet math at velocet.caFri Apr 5 09:59:56 PST 2002
- Previous message: How do you keep clusters running....
- Next message: What could be the performance of my cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Apr 04, 2002 at 10:12:54AM -0300, Leandro Tavares Carneiro's all... > We have here an beowulf cluster with 64 production nodes and 128 > processors, and we have some problems like you, about fans. > Here, our cluster hardware is very cheap, using motherboards and cases > founds easily in the local market, and the problems is critical. > We have 5 spare nodes, and only 3 of that are ready to work. All our [..] > I think this kind of problem is inevitable with cheap PC parts, and can > be lower with high-quality (and price) parts. We are making an study to > by a new cluster, for another application and we call Compaq and IBM to > see what they have in hardware and software, with the hope of a future > with less problems... You can always employ the 'maximum tolerable failure rate' concept and buy for that rate. I find in terms of pricing equipment, there is a definite non linear (exponential?) relationship between MTBF and price. For a failure rate thats 3-5 times higher you can spend up to 40% less (or better) on equipment. This isnt a solid number, but feels within the ballpark to me based on what I've priced out before on clusters. Others may dispute this, but I am talking about buying Dell 2U rackmount servers pre-assembled vs a bunch of boards and CPUs and ram you slap together yourself. Using this concept, and setting your maximum tolerable failure rate at a specific level that suits your needs, for eg 1 node per month, coupled with an agreesive RMA schedule with a good vendor, you can get the best price performance out of a cluster. If you can withstand, using my example, 3-5 times higher failure rate which ends up being 1 node per month, you end up with 40% more gear. If you require 100% of all nodes present to be in one mesh involved in parallel calculations and a single node failure is catastrophic to the entire job running since startup, then its obviously not worth it if your jobs have a similar runtime as the failure rate (1 month). A failure rate of 1 node/5 months would work far better in that case, as the average failure would lose you only 10% of the work you do in 5 months, whereas with 40% more equipment and 5x the failure rate you may lose most of your work. (Note I am not considering that your jobs may run in [1 month / 1.4] instead due to the speedup from more gear - which will cause jobs to run in ~70% of the time (~3 weeks) - and therefore have a higher success rate in finishing in the 1 node/mo MTBF environment.) However, if your jobs run on all nodes for only a day, then a failure of a single node once per month nets you a loss of a half day per month lost work average. For this concession you get 40% more equipment (possibly meaning 40% more processing power, depending on your application). You also need to factor in how much personal time you have to deal with RMAing and swapping equipment. This may well make any efforts towards this kind of model impossible if extra time is not available. That notwithstanding, the cost of extra time can be easily factored into the equation (and knowledgeable work-study undergrads can be a REALLY cheap alternative here :) Of course with 40% more power, you may configure two sub-clusters of 70% power of the original HA design (HA = high availability ~ higher price). If this fits your needs, a failure of a single node once per month on average jobs of a day in length will net you the equivalent loss of a quarter-day total possible work. The more you isolate sections of the cluster from eachother, the less you will lose when a failure occurs. If you can manually segment your jobs to run one per node and still achieve near 100% (or more?) of possible capacity vs a more parallelized system, then a single node failure is inconsequential. Considering the amount and types of failures discussed here, there are obviously no guarantee that a certain type of cluster setup will save you from having massive problems. Being able to plan for downtime and manage the costs associated with it is also obviously part of the design and operation of the cluster. Its a seesaw-type of balance - if you want more nodes for less money, be prepared to spend more time fixing them. Of course with any cluster, more nodes of any type will logically translate into more down/service time - so there will probably be a non-linear translation of amount of work when comparing fewer HA nodes vs more cheaper nodes. Of course by this logic, buying fewer bigger nodes would also result in less work. At some point this becomes too expensive because you're buying big Suns that are very expensive per GFLOPS (unless of course, it suits your needs best...). Another problem with this whole situation that makes it even more complex is that many cluster installations are subject to strange pricing/operation cost models. Various parts may actually lie outside your budget responsability: One time costs: - design costs (on paper) - equipment purchase - equipment cosntruction/installation - equipment configuration - softwre installation & configuration Long term/ongoing: - software maintenance/reconfiguration - upkeep/repair - equipment upgrades - power costs - cooling costs There are probably sub categories these could be split into as well. The issue here is that, say in a university, power and cooling may be paid for by the university as well as manual labour for upkeep and repair. If that is the case, then getting very power-inefficient but fast CPUs may work well (AMD thunderbirds, for eg :). If you have to pay for your own power and cooling and manual labour, then you may well just opt for spending more on cheaper gear (Athlon XPs) - and at that point may as well go for HA gear as well (depending on the cost model) to save expensive manual labour (at commercial rates >$50/hr you can quickly rack up a node's cost in a day of work). We have successfully employed the non-HA equipment deisgn in building one of our clusters - and in fact there are added advantages. We have observed that most (for various values of 'most' - 50% to 80%?) failures occur within the first month of usage. Once you start swapping out bad nodes, you have a falling rate of failure (though the age of components slowly catches up over a long time period - things with moving parts, such as fans, especially). With all problems taken together (swapping over NFS included, as these are diskless nodes) we have about 1 node crash/fail in some way every 2 months. Of course, since jobs can be checkpointed, and a single node failing doesnt take down the whole cluster (as jobs are run on subsets of nodes) not much work is lost overall. For the increased throughput from more nodes for the money, and including about 15 minutes of work per month physically messing with the machines thats directly related to hardware problems and crashes (ie unrelated to the time spent maintaining the cluster as per normal operations), its been an overall win on that particular cluster. (We have not had to RMA any equipment since the start of the 2nd month of operation - under our current service agreement, RMA would take 1-3 days, and about 20-30 min of labour, and in the meantime not significantly impact the cluster's performance). As always, designing your cluster customized for your needs and limitations is always the biggest win on price/performance. Limitations to this are having very wide ranges of needs and not having any idea of what capabilities will be required in the future, along with expensive losses when there's downtime, and expensive manual labour to get things working again. Barring these kinds of considerations, commodity equipment with a failure rate that you can deal with can net noticeable gains - having a planned failure cost related to that rate will save you from suprises. No matter what kind of cluster you build you WILL have failures, and designing to be able to mitigate the impact from such to the highest possible extent is obviously good planning. /kc > Em Qua, 2002-04-03 às 18:04, Cris Rhea escreveu: > > > > What are folks doing about keeping hardware running on large clusters? > > > > Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)... > > > > Sure seems like every week or two, I notice dead fans (each RS-1200 > > has 6 case fans in addition to the 2 CPU fans and 2 power supply fans). > > > > My last fan failure was a CPU fan that toasted the CPU and motherboard. > > > > How are folks with significantly more nodes than mine dealing with constant > > maintenance on their nodes? Do you have whole spare nodes sitting around- > > ready to be installed if something fails, or do you have a pile of > > spare parts? Did you get the vendor (if you purchased prebuilt systems) > > to supply a stockpile of warranty parts? > > > > One of the problems I'm facing is that every time something croaks, > > Racksaver is very good about replacing it under warranty, but getting > > the new parts delivered usually takes several days. > > > > For some things like fans, they sent extras for me to keep on-hand. > > > > For my last fan/CPU/motherboard failure, the node pair will be > > down ~5 days waiting for parts. > > > > Comments? Thoughts? Ideas? > > > > Thanks- > > > > --- Cris > > > > > > > > ---- > > Cristopher J. Rhea Mayo Foundation > > Research Computing Facility Pavilion 2-25 > > crhea at Mayo.EDU Rochester, MN 55905 > > Fax: (507) 266-4486 (507) 284-0587 > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- > Leandro Tavares Carneiro > Analista de Suporte > EP-CORP/TIDT/INFI > Telefone: 2534-1427 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Ken Chase, math at velocet.ca * Velocet Communications Inc. * Toronto, CANADA
- Previous message: How do you keep clusters running....
- Next message: What could be the performance of my cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
