How do you keep clusters running....
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Apr 3 15:27:31 PST 2002
- Previous message: How do you keep clusters running....
- Next message: How do you keep clusters running....
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 3 Apr 2002, Cris Rhea wrote: > Comments? Thoughts? Ideas? a) Use onboard sensors (hoping your motherboards have them) to shut nodes down if the CPU temp exceeds an alarm threshold. That way future fan failures shouldn't cause system failure, just node shutdown. b) Use the largest cases you can manage given your space requirements. Larger cases have a bit more thermal ballast and can tolerate poor cooling for a bit longer before catastrophically failing. Gives you (or your monitor software) more time to react if nothing else. c) With only ten boxes, it sounds like you're having plain old bad luck, possibly caused by a bad batch of fans. Relax, perhaps your luck will improve;-) With all that said, it is still true that maintenance problems scale poorly with number of nodes. One reason (of many) that I prefer not to get nodes from vendors in another state that I never meet face to face. If your nodes are built by a local vendor (especially one with a decent local parts inventory and service department) then it is a bit easier to get good turnaround on node repairs and minimize downtime, especially since a local business rapidly learns that to make you happy is more important to their bottom line than making the next twenty or thirty customers that might walk through their door happy. There is also the usual tradeoff between buying "insurance" (e.g. onsite, 24 hour service contracts) on everything and number of nodes. There are plenty of companies that will sell you nodes and guarantee minimal downtime -- for a price. IBM and Dell come to mind, although there are many more. Only you can determine how mission critical it is to keep your nodes up and what the cost benefit tradeoffs are between buying fewer nodes (but getting better quality nodes and arranging guarantees of minimal downtime) or buying more nodes (but risking having a node or two down pending repairs from time to time). Cost-benefit analysis is at the heart of beowulf engineering, but you have to determine the "values" that enter into the analysis based on your local needs. rgb > > Thanks- > > --- Cris > > > > ---- > Cristopher J. Rhea Mayo Foundation > Research Computing Facility Pavilion 2-25 > crhea at Mayo.EDU Rochester, MN 55905 > Fax: (507) 266-4486 (507) 284-0587 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: How do you keep clusters running....
- Next message: How do you keep clusters running....
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
