disadvantages of a linux cluster

Wed Nov 6 09:58:09 PST 2002

Reliability is just the probability of "not having a failure"... It needs 
some time span associated with it (is it 5 nines reliability over a time 
span of 5 minutes?)..

The real number you are seeking is "availability", which is a combination 
of Mean time to failure and mean time to repair.

And the comments about parallel vs serial faults are well taken.. What's 
defined as a failure? the job dying? the job running 1/256th slower? any 
one component failing? A software reboot required? A three finger salute to 
get "task manager" up on a node?

At 09:18 AM 11/6/2002 -0500, Robert G. Brown wrote:
>On Tue, 5 Nov 2002, Mark Hahn wrote:
>
> > > 256-processor Intel clusters (home grown apps). We run in parallel with
> > > MPI Pro and Cluster Controller and Windows 2000. Reliability is 5-nines;
> > > manageability tools have helped us to reduce systems administration
> > > costs/staff.
> >
> > so what would be the list price of that software?  do you have any
> > data on how reliability would compare with a linux approach?
> > also, .99999 is impressive, only 5 minutes a year; how long have
> > you had the cluster?  is that .99999 counted for all nodes,
> > or do you mean "at least some nodes worked for .99999 of the time"?
> >
> > if you really mean that the sum of all downtime (across all 256 nodes)
> > is 5 minutes/year, that's truely remarkable!
>
>I agree.  In fact, hardware alone is a lot less reliable than that.
>You've been amazingly lucky.  Even with Dell hardware we've never gone a
>year without some sort of hardware failure that involved a day or so of
>downtime (or expensive onsite service contracts and/or lots of spare
>parts sitting around), and one day contains 1440 minutes, or more than
>five minutes per node for 256 nodes.  Just diagnosing a failed part
>(like a bad memory DIMM or crashed disk or burned motherboard) usually
>takes a few hours.  So you've either really got (effectively) 258
>systems with a couple of them functioning as more-or-less-hot spares or
>have had phenomenally good luck.
>
>If the latter, you might try computing your uptime including the hot
>spares.
>
>    rgb