Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

disadvantages of a linux cluster

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Jim Lux James.P.Lux at jpl.nasa.gov
Wed Nov 6 09:58:09 PST 2002


Reliability is just the probability of "not having a failure"... It needs 
some time span associated with it (is it 5 nines reliability over a time 
span of 5 minutes?)..

The real number you are seeking is "availability", which is a combination 
of Mean time to failure and mean time to repair.

And the comments about parallel vs serial faults are well taken.. What's 
defined as a failure? the job dying? the job running 1/256th slower? any 
one component failing? A software reboot required? A three finger salute to 
get "task manager" up on a node?



At 09:18 AM 11/6/2002 -0500, Robert G. Brown wrote:
>On Tue, 5 Nov 2002, Mark Hahn wrote:
>
> > > 256-processor Intel clusters (home grown apps). We run in parallel with
> > > MPI Pro and Cluster Controller and Windows 2000. Reliability is 5-nines;
> > > manageability tools have helped us to reduce systems administration
> > > costs/staff.
> >
> > so what would be the list price of that software?  do you have any
> > data on how reliability would compare with a linux approach?
> > also, .99999 is impressive, only 5 minutes a year; how long have
> > you had the cluster?  is that .99999 counted for all nodes,
> > or do you mean "at least some nodes worked for .99999 of the time"?
> >
> > if you really mean that the sum of all downtime (across all 256 nodes)
> > is 5 minutes/year, that's truely remarkable!
>
>I agree.  In fact, hardware alone is a lot less reliable than that.
>You've been amazingly lucky.  Even with Dell hardware we've never gone a
>year without some sort of hardware failure that involved a day or so of
>downtime (or expensive onsite service contracts and/or lots of spare
>parts sitting around), and one day contains 1440 minutes, or more than
>five minutes per node for 256 nodes.  Just diagnosing a failed part
>(like a bad memory DIMM or crashed disk or burned motherboard) usually
>takes a few hours.  So you've either really got (effectively) 258
>systems with a couple of them functioning as more-or-less-hot spares or
>have had phenomenally good luck.
>
>If the latter, you might try computing your uptime including the hot
>spares.
>
>    rgb





More information about the Beowulf mailing list