Uptime data/studies/anecdotes ... ?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Roger L. Smith roger at ERC.MsState.EduTue Apr 2 10:46:07 PST 2002
- Previous message: Uptime data/studies/anecdotes ... ?
- Next message: Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 2 Apr 2002, Richard Walsh wrote: > Thanks for the estimate. Do you use SCYLD or another pseudo-single-system- > image tool? Nope, We use RH 7.2, PBS, and MPI/Pro, MPICH, and LAM MPI. > I assume that 75% is a steady state number ... how long did > it take your group to reach that state? Our users are a bit "bursty". The cluster rarely drops below 50%. Looking back through my records, it hasn't been below 140 processors in use in several weeks, and has spent most of its time with 400+ in use. As we near project deadlines, we often have jobs waiting in the queue. I've seen as many as 1100 processors in use, or requested and waiting. When we upgraded from 324 to 586 processors, the users were banging on my door wanting to know when the new nodes were available. Within an hour or releasing the new nodes (and without any notification to the users), they were already using over 500 processors. I'm currently working on an expansion to about 1036 processors, and I fully expect to see it slammed within a few days of release. > If a full reboot is required > only every 3-4 months then is singel node failure your main source of > cycle loss? Or are other things like inefficient scheduling and lack of > check-point/restart, etc. important? PBS is our leading cause of cycle loss. We now run a cron job on the headnode that checks every 15 minutes to see if the PBS daemons have died, and if so, it automatically restarts them. About 75% of the time that I have a node fail to accept jobs, it is because its pbs_mom has died, not because there is anything wrong with the node. _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ | Roger L. Smith Phone: 662-325-3625 | | Systems Administrator FAX: 662-325-7692 | | roger at ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | | Mississippi State University | |_______________________Engineering Research Center_______________________|
- Previous message: Uptime data/studies/anecdotes ... ?
- Next message: Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
