[Beowulf] backtraces

Craig Tierney ctierney at hypermall.net
Tue Jun 12 08:34:22 PDT 2007


Gerry Creager wrote:
> I've tried to stay out of this.  Really, I have.
> 
> Craig Tierney wrote:
>> Mark Hahn wrote:
>>>> Sorry to start a flame war....
>>>
>>> what part do you think was inflamed?
>>
>> It was when I was trying to say "Real codes have user-level
>> checkpointing implemented and no code should ever run for 7
>> days."
> 
> A number of my climate simulations will run for 7-10 days to get 
> century-long simulations to complete.  I've run geodesy simulations that 
> ran for up to 17 days in the past.  I like to think that my codes are 
> real enough!
> 

NCAR and GFDL run climate simulations for weeks as well.  How longest
period of time any one job can run?  It is 8-12 hours.  I can verify
these numbers if needed, but I can guarantee you that no one is allowed
to put their job in for 17 days.  With explicit permission they may get
24 hours, but that would be for unique situations.

> Real codes do have user-level checkpointing, though.  And even better 
> codes can be restarted without a lot of user intervention by invoking a 
> run-time flag and going off for coffee.
> 

You mean there are people that bother to implement checkpointing and
then don't make it code like:

if (checkpoint files exist in my directory) then
    load checkpoint files
else
    start from scratch
end

????


>> Set your queue maximums to 6-8 hours.  Prevents system hogging,
>> encourages checkpointing for long runs.  Make sure your IO system
>> can support the checkpointing because it can create a lot of load.
> 
> And how do you support my operational requirements with this policy 
> during hurricane season?  Let's see... "Stop that ensemble run now so 
> the Monte Carlo chemists can play for  awhile, then we'll let you back 
> on.  Don't worry about the timeliness of your simulations.  No one needs 
> a 35-member ensemble for statistical forecasting, anyway."  Did I miss 
> something?
> 

You kick-off the users that are not running operational codes because
their work is (probably) not as time constrained.  Also, if you take
so long to get your answer in an operational mode that the answer 
doesn't matter anymore, you need a faster computer.  I would think that
if you cannot spit out a 12-hour hurricane forecast in a couple of
hours I would be concerned how valuable the answer would be.

Craig

> Yeah, we really do that.  With boundary-condition munging we can run a 
> statistical set of simulations and see what the probabilities are and 
> where, for instance, maximum storm surge is likely to go.  If we don't 
> get sufficient membership in the ensemble, the statistical strength of 
> the forecasting procedure decreases.
> 
> Gerry
> 
>>> part of the reason I got a kick out of this simple backtrace.so
>>> is indeed that it's quite possible to conceive of a checkpoint.so
>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent 
>>> job of checkpointing at least serial codes non-intrusively.
>>>
>>
>> BTW, I like your code.  I had a script written for me in the past
>> (by Greg Lindahl in a galaxy far-far away).  The one modification
>> I would make is to print out the MPI ID evnironment variable (MPI
>> flavors vary how it is set).  Then when it crashes, you know which
>> process actually died.
>>
>> Craig
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
> 





More information about the Beowulf mailing list