Time limits in queues (was: Re: [Beowulf] VMC - Virtual Machine Console)
geoff at galitz.org
Sun Jan 20 10:42:03 PST 2008
Interesting. We (and by we, I refer to my time at UC Berkeley College of
Chemistry) used to implement multiple queues with various time
restrictions to accomdate short, medium, long and extended run jobs. It
was an honor to system to be sure, but I spent a great amount of time
working with the researchers on an indvidual level to foster the trust
that an honor system needs. There was also a little logic to allow
submitted jobs to skew towards one end of the spectrum if the cluster was
not fully utilized, and not expected to be so. Working that closely with
folks also allowed us to chart cluster usage for about a month (and
sometimes much more) so we can tweak cluster policy if appropriate.
It worked out for the most part, but there was the occasional scofflaw.
With the trust relationship I had with the researchers, we could usually
nag the scofflaws back into line.
Layer 8 issues can certainly lead to trouble, but it can also be used to
Just a personal observation. I realize this kind of thing would not work
> Our queue limits are 8 hours. They are set this way for two reasons.
> First, we have real time jobs that need to get through the queues and
> we believe that allowing significantly longer jobs would block those
> really important jobs. Second, for a multi-user system, it isn't very
> fair for a user to run multi-day jobs and prevent shorter jobs from
> in. It is about being fair. Use the resource and then get back in line.
> I know that at other US Government facilities it is common practice to
> set sub-day queue limits. I recently helped setup one site that had
> queue limits set at 12 hours. Another large organization near the top
> of the top 500 list does this as well.
> This means that codes need check-pointing. Although we are all waiting
> for the holy grail of system level check-pointing, the odds of that being
> implemented consistently across architectures AND not have a significant
> performance hit is unlikely. This means that researchers have to also be
> software engineers. If they want to get real work done, adding
> is one of the steps. As one operations manager at a major HPC site once
> to me 'codes that don't support check-pointing aren't real codes'.
> Allowing users to run for days or weeks as SOP is begging for failure.
> Did that sysadmin who set 24 hour time limits ever analyze the amount
> of lost computational time because of larger time limits?
Geoff Galitz, geoff at galitz.org
More information about the Beowulf