Time limits in queues (was: Re: [Beowulf] VMC - Virtual Machine Console)
Craig.Tierney at noaa.gov
Wed Jan 16 09:16:18 PST 2008
..Interesting discussion deleted..
> As a funny aside, I once knew a sysadmin who applied 24 hour timelimits
> to all queues of all clusters he managed in order to force researchers
> to think about checkpoints and smart restarts. I couldn't understand
> why so many folks from his particular unit kept asking me about arrays
> inside the scheduler submission scripts and nested commends until I
> found that out. Unfortunately I came to the conclusion that folks in
> his unit were spending more time writing job submission scripts than
> code... well... maybe that is an exaggeration.
Our queue limits are 8 hours. They are set this way for two reasons.
First, we have real time jobs that need to get through the queues and
we believe that allowing significantly longer jobs would block those
really important jobs. Second, for a multi-user system, it isn't very
fair for a user to run multi-day jobs and prevent shorter jobs from getting
in. It is about being fair. Use the resource and then get back in line.
I know that at other US Government facilities it is common practice to
set sub-day queue limits. I recently helped setup one site that had
queue limits set at 12 hours. Another large organization near the top
of the top 500 list does this as well.
This means that codes need check-pointing. Although we are all waiting
for the holy grail of system level check-pointing, the odds of that being
implemented consistently across architectures AND not have a significant
performance hit is unlikely. This means that researchers have to also be
software engineers. If they want to get real work done, adding check-pointing
is one of the steps. As one operations manager at a major HPC site once said
to me 'codes that don't support check-pointing aren't real codes'.
Allowing users to run for days or weeks as SOP is begging for failure.
Did that sysadmin who set 24 hour time limits ever analyze the amount
of lost computational time because of larger time limits?
Craig Tierney (craig.tierney at noaa.gov)
More information about the Beowulf