Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Uptime data/studies/anecdotes ... ?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Richard Walsh rbw at ahpcrc.org
Tue Apr 2 10:24:22 PST 2002


On Tue, 2 Apr 2002 10:15:00 Roger Smith wrote:

>We currently run an average of about 75% utilization on our 586 processor
>(293 node)  cluster.  We probably have about one node per week crash and
>hang for various reasons.
>
>We have occasional problems with memory leaks or PBS hangups which require
>large scale reboots of the cluster. (Actually, PBS just died as I'm typing
>this, but our pbs heartbeat script should restart it automatically in a
>few minutes).  I'd say we have to do a full reboot of the cluster about
>every 3-4 months.
                                                                                >For a bunch of PC hardware running a free OS, this seems like a pretty
>good number to me.  It's not in the same class as our Sun servers (nor
>even our SGIs!), but then, none of those systems are this large, either.

Thanks for the estimate.  Do you use SCYLD or another pseudo-single-system-
image tool? I assume that 75% is a steady state number ... how long did
it take your group to reach that state?  If a full reboot is required 
only every 3-4 months then is singel node failure your main source of 
cycle loss? Or are other things like inefficient scheduling and lack of 
check-point/restart, etc. important?

75% does seem like a reasonably good number.

rbw




More information about the Beowulf mailing list