[Beowulf] backtraces

Mark Hahn hahn at mcmaster.ca
Mon Jun 11 19:00:02 PDT 2007


> Sorry to start a flame war....

what part do you think was inflamed?

> Make sure that your code generates the exact same answer with debug/backtrace 
> enabled and disabled,

part of the point of my very simple backtrace.so is that it has zero 
runtime overhead and doesn't require any special compilation.

> then you add user-level checkpointing so that you can

I'm most curious to hear people's experience with checkpointing.
all our more serious, established codes do checkpointing, 
but it's extremely foreign to people writing newish codes.
and, of course, it's a lot of extra work.  I'm not arguing against
checkpointing, just acknowledging that although we _require_ it,
we don't actually demand "proof-of-checkpointability".

> restart where you want.  Then you
> run up until the problem and restart with the last checkpoint.

restarting from checkpoint is fine (the code in question could
actually do it), but still means you have hours of running,
presumably under a debugger.

> Run for a week without checkpointing?  Just begging for trouble.

suppose you have 2k users, with ~300 active at any instant,
and probably 200 unrelated codes running.  while we do require
checkpointing (I usually say "every 6-8 cpu hours"), I suspect 
that many users never do.  how do you check/validate/encourage/support
checkpointing?

part of the reason I got a kick out of this simple backtrace.so
is indeed that it's quite possible to conceive of a checkpoint.so
which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly 
decent job of checkpointing at least serial codes non-intrusively.

regards, mark hahn.



More information about the Beowulf mailing list