Uptime data/studies/anecdotes ... ?

Rayson Ho raysonlogin at yahoo.com
Tue Apr 2 20:07:19 PST 2002


--- "Roger L. Smith" <roger at ERC.MsState.Edu> wrote:
> We currently run an average of about 75% utilization on our 586
> processor (293 node)  cluster.  We probably have about one node per 
> week crash and hang for various reasons.

The OpenPBS backfilling algorithm is really bad. If you are running
parallel jobs, you should use PBS+Maui.

> We have occasional problems with memory leaks or PBS hangups which
> require large scale reboots of the cluster. (Actually, PBS just died 
> as I'm typing this, but our pbs heartbeat script should restart it 
> automatically in a few minutes).  I'd say we have to do a full reboot

> of the cluster about every 3-4 months.

One bigger problem is (or was, I haven't been looking at PBS code since
last fall) that in each scheduling cycle, the scheduler tries to
contact each MOM in the cluster to get resource information, but if one
of the MON dies, then the scheduler hangs... and then timeout &
restarts.

You may try the "Cplant Fault Recovery Patch" and several other patches
if you want to stay with PBS.

> For a bunch of PC hardware running a free OS, this seems like a
> pretty good number to me.  It's not in the same class as our Sun 
> servers (nor even our SGIs!), but then, none of those systems are 
> this large, either.

Another problem (at least in OpenPBS 2.3.12) is that there are some
hard limit that is defined in the source (like 
"#define PBS_ACCT_MAX_RCD 4095", "#define PBS_NET_MAX_CONNECTIONS 256",
which may not work in large clusters)

If you want something free, then you may try SGE. It scales quite
nicely (SGE improved a lot in 5.3), it's open source, and integrates
with Maui.

I like SGE better than OpenPBS. -- at least when one (or more?) of your
nodes dies, the cluster continues to operate, and SGE even re-runs the
job for you. Another feature is the shadow master, which restarts the
master daemon on other machines if your master node dies.

I think someone on this list is planning to tell us his experience with
SGE on his beowulf?

Rayson

P.S. 

links:
OpenPBS public home: http://www-unix.mcs.anl.gov/openpbs/
SGE                : http://gridengine.sunsource.net
Maui               : http://www.supercluster.org




__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/



More information about the Beowulf mailing list