Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: Time limits in queues

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Lombard, David N dnlombar at ichips.intel.com
Thu Jan 17 08:34:19 PST 2008


On Thu, Jan 17, 2008 at 02:53:36PM +0100, Bogdan Costescu wrote:
> On Wed, 16 Jan 2008, Craig Tierney wrote:
> 
> >Our queue limits are 8 hours.
> >...
> >Did that sysadmin who set 24 hour time limits ever analyze the amount
> >of lost computational time because of larger time limits?
> 
> While I agree with the idea and reasons of short job runtime limits, I 
> disagree with your formulation. Being many times involved in 
> discussions about what runtime limits should be set, I wouldn't make 
> myself a statement like yours; I would say instead: YMMV. In other 
> words: choose what fits better the job mix that users are actually 
> running. If you have determined that 8h max. runtime is appropriate 
> for _your_ cluster and increasing it to 24h would lead to a waste of 
> computational time due to the reliability of _your_ cluster, then 
> you've done your job well. But saying that everybody should use this 
> limit is wrong.

Completely agree.

> Furthermore, although you mention that system-level checkpointing is 
> associated with a performance hit, you seem to think that user-level 
> checkpointing is a lot lighter, which is most often not the case. 

Hmmm. A system level checkpoint must save the complete state of the
process to be checkpointed plus all of its siblings/children plus varying
amounts of external state; a machine level checkpoint must save complete
machine(s) state.  A user level checkpoint need only save the data that
define the current state--that could well be a small set of values.

Having written that, it may be *easier* (even cheaper) to expend the
resources to save the complete state than to restructure some suitably
complex code to expose a restart state.  I certainly know an application
that fits that model during most of its runtime. But, at the end of
the day, that is just trading runtime for design/coding/validation
time and the notion's validity depends on which side of the operation
you sit.  Consider this though, if as an admin you only rely on user-
level checkpoint, you *will* end up with an argument from one or more
users about the maximum runtime at some point; with a system (or machine)
checkpoint, you'll likely avoid a lot of agida[1], especially when
unplanned or emergency outages/reprioritzations occur.

> Apart from the obvious I/O limitations that could restrict saving & 
> loading of checkpointing data, there are applications for which 
> developers have chosen to not store certain data but recompute it 
> every time it is needed because the effort of saving, storing & 
> loading it is higher than the computational effort of recreating it - 
> but this most likely means that for each restart of the application 
> this data has to be recomputed. And smaller max. runtimes mean more 
> restarts needed to reach the same total runtime...

As you note, only the application can know that it's easier to recompute
than save and restore.  I suspect many of us can site specific examples
where it's easier to recompute; some could probably also cite cases
where recomputing is faster too...

[1] Hearburn, indigestion, general upset or agitation.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.



More information about the Beowulf mailing list