[Beowulf] A bit OT - scientific workstations - recommendations
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Mar 9 05:29:34 PST 2006
- Previous message: [Beowulf] A bit OT - scientific workstations - recommendations
- Next message: [Beowulf] A bit OT - scientific workstations - recommendations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 6 Mar 2006, Roland Krause wrote: > > > --- Douglas Eadline <deadline at clustermonkey.net> wrote: > >>> >>> So Joe's observation is apropos. You engineer for your own >> particular >>> perception of costs of downtime and willingness to accept risks, >>> INCLUDING the substantial cost of your own time screwing around >> with >>> things. >> >> Sure, my definition of screwing around with things is putting it >> in a box and sending it to the vendor for repair. >> > > This is not how things have worked in my experience. My experience so > far is: First you call the vendor, you spend an hour on the phone > rebooting the machine, checking the BIOS, explaining your problems, bla > bla... Then, maybe, you get a RMA. Most of the time though the vendor > will want to send you a replacement part that you are supposed to put > in by yourself. Btw., DELL is one of the worst offenders I have ever > dealt with in this respect. Yah, it is a rare hardware problem that doesn't eat an hour FTE, and many of them will eat 3-4 hours FTE, per occurrence, WITH service of one sort or another. With a small cluster and reasonably reliable hardware this is acceptable or at least survivable. With a large cluster -- 100's to 1000's of nodes -- you can get to the point where the average rate of hardware failures approaches one per day, taking a significant fraction of one FTE just to deal with the service calls. Alas, those failures are NOT necessarily uniformly distributed in time (or even distributed by a straight poissonian process). There is significant clustering (bunching) of the parts that fail, the times they fail, and the proximate causes of failure (e.g. a heating or electrical bobble, a particularly "hot" job, a power supply, a bad CPU cooling fan, a defective motherboard capacitor that blows and spews oil all over the side of your chassis in a puff of smoke). A couple or three service calls a week, smoothly distributed, are annoying but manageable by a single FTE and still leave them with time to manage the systems, help users, work on project software, and so on. Fifty or a hundred service calls in a week can mean not getting anything done for an extended period of time -- the cost in loss of productivity is huge. Self-service is like this, only multiply all FTE time requirements by oh, four or so and add out-of-pocket expenses for parts and the cost of a decent bench with diagnostic hardware and tools (not a bad thing to have in any event but essential if you're even going to HELP maintain hardware). Then a single hardware failure is something like: * box dies, cause unknown * derack or deshelf box, put it on your bench, hook it up to local monitor/network etc. Open it up. Maybe partially disassemble it to test parts individually in your handy "this box definitely works" unit, one at a time. * debug/diagnose cause of failure. Sometimes easy (turn it on, hear CPU fan grinding away or observe that CPU fan doesn't spin up). Sometimes difficult -- running a memory tester for 48 hours to turn up a handful of hard errors that are rare enough to let the system boot and run for a week but common enough to corrupt computations and eventually the kernel and cause a system crash. * acquire replacement parts and/or pull them from a shelf of replacements and pop them in. * retest system to validate reliable operation. If it fails, return to debug/diagnose step and loop until... * it succeeds, box is all happy now, rerack it and return it to service. Timewise, say 15 minutes for deracking, anywhere from five minutes to fifty hours (sorry, but that's just the way it is, especially for a rare memory failure or a thermally mediated failure) debugging per loop pass, anywhere from five minutes to 30 minutes for actually replace hardware PLUS the time required to acquire the replacement, anywhere from 30 minutes to hours for validation, 15 minutes to rerack. Add it up and it is maybe an hour to replace a CPU fan from a stock already on the shelf. Maybe a day to find something complex or to discover and fix multiple failures (not uncommon after a period of overheating). Several days (or at least multiple hours spread over several days) if the problem is e.g. EITHER the CPU OR memory OR the motherboard itself and you don't have replacements or a good testbed system set up. I've been perfectly happy building my own systems and self-maintaining them at home and for small clusters -- 16 nodes, say -- at work. I've tried extending this model to larger clusters -- ~100 nodes -- and had multisystem failure experiences that are the equivalent of being shocked repeatedly by one of those dog training collars. "Pain" is somehow inadequate to describe the loss of productivity, the loss of personal time, the out of pocket expense, the anger, the desperation. Hence my simple rules for buying pro-scale cluster nodes. a) Get high quality hardware from a reputable vendor that will work with you to validate linux functionality and ensure long-term replacement part availability. b) Get a service contract from said vendor, per node. Ideally onsite, as just deracking, boxing and returning, receiving, testing and reracking an RMA node is hours of time. Letting somebody into your cluster room and pointing them at your bench and the downed node and walking away is hours of THEIR time, minutes of yours. This costs you a known, fixed fraction of productivity in the form of more expensive nodes and hence fewer of them. It provides "insurance" against the potentially HUGE losses of productivity that can occur in the case of multisystems failure or a "lemon part", and it in any event limits the FTE time required per failure to resolve them. A tradeoff, but one that I think is well worth it for midscale clusters and ABSOLUTELY ESSENTIAL for really large clusters. Unless, of course, you have so large a cluster (and budget) that assigning a whole FTE admin who does NOTHING but hardware maintenance from a local warehouse of spare parts is cost effective... rgb > > Roland > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] A bit OT - scientific workstations - recommendations
- Next message: [Beowulf] A bit OT - scientific workstations - recommendations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
