Uptime data/studies/anecdotes ... ?
Roger L. Smith
roger at ERC.MsState.Edu
Tue Apr 2 10:46:07 PST 2002
On Tue, 2 Apr 2002, Richard Walsh wrote:
> Thanks for the estimate. Do you use SCYLD or another pseudo-single-system-
> image tool?
Nope, We use RH 7.2, PBS, and MPI/Pro, MPICH, and LAM MPI.
> I assume that 75% is a steady state number ... how long did
> it take your group to reach that state?
Our users are a bit "bursty". The cluster rarely drops below 50%.
Looking back through my records, it hasn't been below 140 processors in
use in several weeks, and has spent most of its time with 400+ in use.
As we near project deadlines, we often have jobs waiting in the queue.
I've seen as many as 1100 processors in use, or requested and waiting.
When we upgraded from 324 to 586 processors, the users were banging on my
door wanting to know when the new nodes were available. Within an hour or
releasing the new nodes (and without any notification to the users), they
were already using over 500 processors. I'm currently working on an
expansion to about 1036 processors, and I fully expect to see it slammed
within a few days of release.
> If a full reboot is required
> only every 3-4 months then is singel node failure your main source of
> cycle loss? Or are other things like inefficient scheduling and lack of
> check-point/restart, etc. important?
PBS is our leading cause of cycle loss. We now run a cron job on the
headnode that checks every 15 minutes to see if the PBS daemons have died,
and if so, it automatically restarts them. About 75% of the time that I
have a node fail to accept jobs, it is because its pbs_mom has died, not
because there is anything wrong with the node.
| Roger L. Smith Phone: 662-325-3625 |
| Systems Administrator FAX: 662-325-7692 |
| roger at ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger |
| Mississippi State University |
|_______________________Engineering Research Center_______________________|
More information about the Beowulf