[Beowulf] Application independent checkpoint/resume?
janne.blomqvist at aalto.fi
Tue Mar 5 23:04:46 PST 2019
at a conference last year I asked one of the CRIU developers about IB RDMA support for live migrations, and he said it doesn't support it, and no plans either.
In a way it makes sense, considering what CRIU does is basically a dump of the process memory + some support for kernel-managed objects like file descriptions. With RDMA you're basically mapping the NIC HW buffers into your process and spray away, so how could that be checkpointed (at that level)?
I'd guess it could theoretically be possible to leverage CRIU to handle the rest, and then have the MPI library take care of fixing up the RDMA stuff? Though I'm not aware of any effort in this direction.
In addition to the things you listed, there's BLCR, though I have no experience with it.
(My (entirely theoretical) interest in this topic is not checkpoint/restart per se, but rather using live migrations to reduce job fragmentation, and optimize cpu/memory layout etc. But again, I'm not aware of any effort in this direction.)
From: Beowulf <beowulf-bounces at beowulf.org> on behalf of Christopher Samuel <chris at csamuel.org>
Sent: Monday, March 4, 2019 9:41:59 PM
To: Beowulf Mailing List
Subject: [Beowulf] Application independent checkpoint/resume?
Just wondering if folks here have recent experiences here with
application independent checkpoint/resume mechanisms like DMTCP or CRIU?
Especially interested for MPI uses, and extra bonus points for
experiences on Cray. :-)
From what I can see CRIU doesn't seem to support MPI at all, and DMTCP
only supports it over TCP/IP or (with a supplied plugin) Infiniband. Are
those inferences true?
Any others I've missed?
All the best,
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf