[Beowulf] Checkpointing MPI applications

Christopher Samuel chris at csamuel.org
Thu Mar 23 19:46:02 UTC 2023


On 2/19/23 10:26 am, Scott Atchley wrote:

> Hi Chris,

Hi Scott!

> It looks like it tries to checkpoint application state without 
> checkpointing the application or its libraries (including MPI). I am 
> curious if the checkpoint sizes are similar or significantly larger to 
> the application's typical outputs/checkpoints. If they are much larger, 
> the time to write will be higher and they will stress capacity more.

Hmm, I'm not sure (my involvement is relatively peripheral) but I think 
we want to see this used with apps that have no existing C/R mechanism. 
If you ping me directly I can point you to people who will know more 
than I on this.

> We are looking at SCR for Frontier with the idea that users can store 
> checkpoints on the node-local drives with replication to a buddy node. 
> SCR will manage migrating non-defensive checkpoints to Lustre.

Interesting, does it really need local storage or can it be used with 
diskless systems via tricks with loopback filesystems, etc?

All the best,
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the Beowulf mailing list