[Beowulf] A bit OT - scientific workstations - recommendations

Thu Mar 9 05:29:34 PST 2006

On Mon, 6 Mar 2006, Roland Krause wrote:

>
>
> --- Douglas Eadline <deadline at clustermonkey.net> wrote:
>
>>>
>>> So Joe's observation is apropos.  You engineer for your own
>> particular
>>> perception of costs of downtime and willingness to accept risks,
>>> INCLUDING the substantial cost of your own time screwing around
>> with
>>> things.
>>
>> Sure, my definition of screwing around with things is putting it
>> in a box and sending it to the vendor for repair.
>>
>
> This is not how things have worked in my experience. My experience so
> far is: First you call the vendor, you spend an hour on the phone
> rebooting the machine, checking the BIOS, explaining your problems, bla
> bla... Then, maybe, you get a RMA. Most of the time though the vendor
> will want to send you a replacement part that you are supposed to put
> in by yourself. Btw., DELL is one of the worst offenders I have ever
> dealt with in this respect.

Yah, it is a rare hardware problem that doesn't eat an hour FTE, and
many of them will eat 3-4 hours FTE, per occurrence, WITH service of one
sort or another.

With a small cluster and reasonably reliable hardware this is acceptable
or at least survivable.  With a large cluster -- 100's to 1000's of
nodes -- you can get to the point where the average rate of hardware
failures approaches one per day, taking a significant fraction of one
FTE just to deal with the service calls.

Alas, those failures are NOT necessarily uniformly distributed in time
(or even distributed by a straight poissonian process).  There is
significant clustering (bunching) of the parts that fail, the times they
fail, and the proximate causes of failure (e.g. a heating or electrical
bobble, a particularly "hot" job, a power supply, a bad CPU cooling fan,
a defective motherboard capacitor that blows and spews oil all over the
side of your chassis in a puff of smoke).  A couple or three service
calls a week, smoothly distributed, are annoying but manageable by a
single FTE and still leave them with time to manage the systems, help
users, work on project software, and so on.  Fifty or a hundred service
calls in a week can mean not getting anything done for an extended
period of time -- the cost in loss of productivity is huge.

Self-service is like this, only multiply all FTE time requirements by
oh, four or so and add out-of-pocket expenses for parts and the cost of
a decent bench with diagnostic hardware and tools (not a bad thing to
have in any event but essential if you're even going to HELP maintain
hardware). Then a single hardware failure is something like:

   * box dies, cause unknown

   * derack or deshelf box, put it on your bench, hook it up to local
monitor/network etc.  Open it up.  Maybe partially disassemble it to
test parts individually in your handy "this box definitely works" unit,
one at a time.

   * debug/diagnose cause of failure.  Sometimes easy (turn it on, hear
CPU fan grinding away or observe that CPU fan doesn't spin up).
Sometimes difficult -- running a memory tester for 48 hours to turn up a
handful of hard errors that are rare enough to let the system boot and
run for a week but common enough to corrupt computations and eventually
the kernel and cause a system crash.

   * acquire replacement parts and/or pull them from a shelf of
replacements and pop them in.

   * retest system to validate reliable operation.  If it fails, return
to debug/diagnose step and loop until...

   * it succeeds, box is all happy now, rerack it and return it to
service.

Timewise, say 15 minutes for deracking, anywhere from five minutes to
fifty hours (sorry, but that's just the way it is, especially for a rare
memory failure or a thermally mediated failure) debugging per loop pass,
anywhere from five minutes to 30 minutes for actually replace hardware
PLUS the time required to acquire the replacement, anywhere from 30
minutes to hours for validation, 15 minutes to rerack.  Add it up and it
is maybe an hour to replace a CPU fan from a stock already on the shelf.
Maybe a day to find something complex or to discover and fix multiple
failures (not uncommon after a period of overheating).  Several days (or
at least multiple hours spread over several days) if the problem is e.g.
EITHER the CPU OR memory OR the motherboard itself and you don't have
replacements or a good testbed system set up.

I've been perfectly happy building my own systems and self-maintaining
them at home and for small clusters -- 16 nodes, say -- at work.  I've
tried extending this model to larger clusters -- ~100 nodes -- and had
multisystem failure experiences that are the equivalent of being shocked
repeatedly by one of those dog training collars.  "Pain" is somehow
inadequate to describe the loss of productivity, the loss of personal
time, the out of pocket expense, the anger, the desperation.

Hence my simple rules for buying pro-scale cluster nodes.

   a) Get high quality hardware from a reputable vendor that will work
with you to validate linux functionality and ensure long-term
replacement part availability.

   b) Get a service contract from said vendor, per node.  Ideally onsite,
as just deracking, boxing and returning, receiving, testing and
reracking an RMA node is hours of time.  Letting somebody into your
cluster room and pointing them at your bench and the downed node and
walking away is hours of THEIR time, minutes of yours.

This costs you a known, fixed fraction of productivity in the form of
more expensive nodes and hence fewer of them.  It provides "insurance"
against the potentially HUGE losses of productivity that can occur in
the case of multisystems failure or a "lemon part", and it in any event
limits the FTE time required per failure to resolve them.  A tradeoff,
but one that I think is well worth it for midscale clusters and
ABSOLUTELY ESSENTIAL for really large clusters.  Unless, of course, you
have so large a cluster (and budget) that assigning a whole FTE admin
who does NOTHING but hardware maintenance from a local warehouse of
spare parts is cost effective...

    rgb

>
> Roland
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu