ctierney at hypermall.net
Mon Jun 11 20:54:28 PDT 2007
Mark Hahn wrote:
>> Sorry to start a flame war....
> what part do you think was inflamed?
It was when I was trying to say "Real codes have user-level
checkpointing implemented and no code should ever run for 7
>> Make sure that your code generates the exact same answer with
>> debug/backtrace enabled and disabled,
> part of the point of my very simple backtrace.so is that it has zero
> runtime overhead and doesn't require any special compilation.
Does the Intel version have overhead? I never measured it before,
but I never thought it was much.
>> then you add user-level checkpointing so that you can
> I'm most curious to hear people's experience with checkpointing.
> all our more serious, established codes do checkpointing, but it's
> extremely foreign to people writing newish codes.
> and, of course, it's a lot of extra work. I'm not arguing against
> checkpointing, just acknowledging that although we _require_ it,
> we don't actually demand "proof-of-checkpointability".
I included checkpointing in an ocean-model once. It was very easy,
but that was most likely because of how it was organized (Fortran 77,
most data structures were shared).
I don't think that it is foreign to people writing new codes.
It is foreign to scientists. Software developers (who could be
scientists) would think of this from the beginning (I hope).
>> restart where you want. Then you
>> run up until the problem and restart with the last checkpoint.
> restarting from checkpoint is fine (the code in question could
> actually do it), but still means you have hours of running,
> presumably under a debugger.
>> Run for a week without checkpointing? Just begging for trouble.
> suppose you have 2k users, with ~300 active at any instant,
> and probably 200 unrelated codes running. while we do require
> checkpointing (I usually say "every 6-8 cpu hours"), I suspect that many
> users never do. how do you check/validate/encourage/support
Set your queue maximums to 6-8 hours. Prevents system hogging,
encourages checkpointing for long runs. Make sure your IO system
can support the checkpointing because it can create a lot of load.
> part of the reason I got a kick out of this simple backtrace.so
> is indeed that it's quite possible to conceive of a checkpoint.so
> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job
> of checkpointing at least serial codes non-intrusively.
BTW, I like your code. I had a script written for me in the past
(by Greg Lindahl in a galaxy far-far away). The one modification
I would make is to print out the MPI ID evnironment variable (MPI
flavors vary how it is set). Then when it crashes, you know which
process actually died.
More information about the Beowulf