Uptime data/studies/anecdotes ... ?
rbw at ahpcrc.org
Tue Apr 2 10:24:22 PST 2002
On Tue, 2 Apr 2002 10:15:00 Roger Smith wrote:
>We currently run an average of about 75% utilization on our 586 processor
>(293 node) cluster. We probably have about one node per week crash and
>hang for various reasons.
>We have occasional problems with memory leaks or PBS hangups which require
>large scale reboots of the cluster. (Actually, PBS just died as I'm typing
>this, but our pbs heartbeat script should restart it automatically in a
>few minutes). I'd say we have to do a full reboot of the cluster about
>every 3-4 months.
>For a bunch of PC hardware running a free OS, this seems like a pretty
>good number to me. It's not in the same class as our Sun servers (nor
>even our SGIs!), but then, none of those systems are this large, either.
Thanks for the estimate. Do you use SCYLD or another pseudo-single-system-
image tool? I assume that 75% is a steady state number ... how long did
it take your group to reach that state? If a full reboot is required
only every 3-4 months then is singel node failure your main source of
cycle loss? Or are other things like inefficient scheduling and lack of
check-point/restart, etc. important?
75% does seem like a reasonably good number.
More information about the Beowulf