[Beowulf] backtraces

Mon Jun 11 20:54:28 PDT 2007

Mark Hahn wrote:
>> Sorry to start a flame war....
> 
> what part do you think was inflamed?

It was when I was trying to say "Real codes have user-level
checkpointing implemented and no code should ever run for 7
days."

> 
>> Make sure that your code generates the exact same answer with 
>> debug/backtrace enabled and disabled,
> 
> part of the point of my very simple backtrace.so is that it has zero 
> runtime overhead and doesn't require any special compilation.
> 

Does the Intel version have overhead?  I never measured it before,
but I never thought it was much.

>> then you add user-level checkpointing so that you can
> 
> I'm most curious to hear people's experience with checkpointing.
> all our more serious, established codes do checkpointing, but it's 
> extremely foreign to people writing newish codes.
> and, of course, it's a lot of extra work.  I'm not arguing against
> checkpointing, just acknowledging that although we _require_ it,
> we don't actually demand "proof-of-checkpointability".
> 

I included checkpointing in an ocean-model once.  It was very easy,
but that was most likely because of how it was organized (Fortran 77,
most data structures were shared).

I don't think that it is foreign to people writing new codes.
It is foreign to scientists.  Software developers (who could be
scientists) would think of this from the beginning (I hope).

>> restart where you want.  Then you
>> run up until the problem and restart with the last checkpoint.
> 
> restarting from checkpoint is fine (the code in question could
> actually do it), but still means you have hours of running,
> presumably under a debugger.
> 
>> Run for a week without checkpointing?  Just begging for trouble.
> 
> suppose you have 2k users, with ~300 active at any instant,
> and probably 200 unrelated codes running.  while we do require
> checkpointing (I usually say "every 6-8 cpu hours"), I suspect that many 
> users never do.  how do you check/validate/encourage/support
> checkpointing?
> 

Set your queue maximums to 6-8 hours.  Prevents system hogging,
encourages checkpointing for long runs.  Make sure your IO system
can support the checkpointing because it can create a lot of load.

> part of the reason I got a kick out of this simple backtrace.so
> is indeed that it's quite possible to conceive of a checkpoint.so
> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job 
> of checkpointing at least serial codes non-intrusively.
> 

BTW, I like your code.  I had a script written for me in the past
(by Greg Lindahl in a galaxy far-far away).  The one modification
I would make is to print out the MPI ID evnironment variable (MPI
flavors vary how it is set).  Then when it crashes, you know which
process actually died.

Craig