[Beowulf] A bit OT - scientific workstations - recommendations

Sun Mar 5 13:45:44 PST 2006

> On Fri, 3 Mar 2006, Joe Landman wrote:
>
>>
>>
>> Douglas Eadline wrote:
>>>> - 24/7/365 next day on site support
>>>
>>>
>>> Let's consider this idea in light of commodity hardware.
>>> (i.e. why I don't buy service contracts on light bulbs)
>>>
>>> Assuming that you can ship a node back for no cost repair within a
>>> warranty period, the question to ask is how many spare nodes
>>> can I buy for the price of a service contract for all the nodes
>>> in my cluster?
>>
>> We live in the era of disposable computing.
>>
>> If your business case demands that you have no down time, then you
>> engineer
>> around that.  If it demands you minimize costs, you need to adjust your
>> expectations on what you will get for those costs.
>
> Your metaphor is a bit strained, Doug.  Nodes more often resemble
> employees more than they resemble light bulbs.  Losing an employee, or
> even two, is rarely fatal to a business, but there are often highly
> nonlinear costs associated with things like an "epidemic" that affects
> your entire workforce.

Allow me to refine my metaphor, if nodes work as advertised, then
they can be treated like light bulbs. IMO, the situation you faced was
due to not getting what your expected. I assume a product should
work as stated. I also know that there is a  "real world"
kind of failure rate. I also was assuming a 1 year "fix or repair" warranty.

>
> So, the issue is do you get health insurance for your employees that
> more or less guarantees that EVEN an epidemic will cost you only a day
> or so of downtime and a minimal amount of personal stress, or do you
> plan on "playing doctor" for your employees, one at a time, as they
> become ill, living with whatever downtime that process generates and
> paying the cost of their medicines out of pocket.

Well, I think Joe is right, if you are concerned about an epidemic,
then you need health insurance. However, I believe epidemics
are product defects and if a vendor sold you a node it should work.
Based on your experience, it would seem wise to negotiate a clause that
basically says "should the defect rate for any given part exceed X,
then there needs to be complete replacement at vendor expense." Similar to
the the "lemon laws" they have for cars (whoops, I mentioned a car
analogy). Now I admit there is a gray area where you can build you own
epidemic by building your own systems.

>
> [Hey, at least it isn't a car metaphor;-)]
>
> So Joe's observation is apropos.  You engineer for your own particular
> perception of costs of downtime and willingness to accept risks,
> INCLUDING the substantial cost of your own time screwing around with
> things.

Sure, my definition of screwing around with things is putting it
in a box and sending it to the vendor for repair.

>
> Having lived through one "nightmare" cluster where every node had to be
> re-engineered on the fly, where every node had to have their bios
> reflashed to become semistable, where every node had to be "fixed",
> where every node had to have its cooling fans replaced -- I will not
> willingly do that again.  We saved 10% or so on the purchase price, but
> paid it out 10 times over in a mix of direct costs for parts and human
> time.

Then I would say your "light bulbs" were defective.

>
> For a home cluster, a hobby cluster, a small prototyping cluster (<= 8
> nodes) sure, working without a net is reasonable.  For a professional
> grade production cluster you CAN consider doing it and might have an
> entire career where it works for you.  Or you could be OH SO SORRY on
> your very first time when the whole damn thing breaks and is out of its
> paltry 90 day mfrs warranty (assuming that you get the cheapest of
> commodity boxes while seeking to save maximal money:-).

Well, that is the issue. If the whole damn thing breaks, then IMO
you are off the bell curve. With clusters in particular, there
is a reasonable expectation that nodes are independent and the
failure of one or two nodes will not bring down the cluster. If all
the nodes fail for some pathological reason, then, IMO, they never
really worked to begin with.

My point is really the cost of commodity hardware allows one to
re-evaluate  the traditional "service contracts" model developed for
proprietary vendor hardware. Of course, it is always nice to have
someone whose job it is to listen to your problems.

Doug