disadvantages of a linux cluster
James.P.Lux at jpl.nasa.gov
Wed Nov 6 09:58:09 PST 2002
Reliability is just the probability of "not having a failure"... It needs
some time span associated with it (is it 5 nines reliability over a time
span of 5 minutes?)..
The real number you are seeking is "availability", which is a combination
of Mean time to failure and mean time to repair.
And the comments about parallel vs serial faults are well taken.. What's
defined as a failure? the job dying? the job running 1/256th slower? any
one component failing? A software reboot required? A three finger salute to
get "task manager" up on a node?
At 09:18 AM 11/6/2002 -0500, Robert G. Brown wrote:
>On Tue, 5 Nov 2002, Mark Hahn wrote:
> > > 256-processor Intel clusters (home grown apps). We run in parallel with
> > > MPI Pro and Cluster Controller and Windows 2000. Reliability is 5-nines;
> > > manageability tools have helped us to reduce systems administration
> > > costs/staff.
> > so what would be the list price of that software? do you have any
> > data on how reliability would compare with a linux approach?
> > also, .99999 is impressive, only 5 minutes a year; how long have
> > you had the cluster? is that .99999 counted for all nodes,
> > or do you mean "at least some nodes worked for .99999 of the time"?
> > if you really mean that the sum of all downtime (across all 256 nodes)
> > is 5 minutes/year, that's truely remarkable!
>I agree. In fact, hardware alone is a lot less reliable than that.
>You've been amazingly lucky. Even with Dell hardware we've never gone a
>year without some sort of hardware failure that involved a day or so of
>downtime (or expensive onsite service contracts and/or lots of spare
>parts sitting around), and one day contains 1440 minutes, or more than
>five minutes per node for 256 nodes. Just diagnosing a failed part
>(like a bad memory DIMM or crashed disk or burned motherboard) usually
>takes a few hours. So you've either really got (effectively) 258
>systems with a couple of them functioning as more-or-less-hot spares or
>have had phenomenally good luck.
>If the latter, you might try computing your uptime including the hot
More information about the Beowulf