[Beowulf] HPC fault tolerance using virtualization

Mon Jun 15 11:47:40 PDT 2009

On Mon, Jun 15, 2009 at 1:59 PM, John Hearns<hearnsj at googlemail.com> wrote:
> Proactive Fault Tolerance for HPC using Xen virtualization
>
> Its something I've wanted to see working - doing a Xen live migration
> of a 'dodgy' compute node, and the job just keeps on trucking.
> Looks as if these guys have it working. Anyone else seen similar?

I haven't seen it in the field yet, but I had hoped to do something
similar with a cluster this summer.  I hadn't seen the above paper
before, but I was basing my test on some papers I'd seen about using
Xen with cloud computing initiatives (ala AWS or Eucalyptus).

Ideally I'd like to see Infiniband worked into the mist, so that I
could use high speed messaging within the xen images and then live
migrate an image as need arises.  DK Panda has a paper that shows a
little bit of this, but details are far and few.

It would be nice to be able to just move bad hardware out from under a
running job without affecting the run of the job.  If it took an extra
ten minutes for the job to run because of a migration i think thats a
small price to pay for actually having the run go to completion and
not have to worry some much about checkpoints.

Course having said all that, if you've been watching the linux-kernel
mailing list you've probably noticed the Xen/Kvm/Linux HV argument
that took place last week.  Makes me a little afraid to push any Linux
HV solution into to production, but it's a fun experiment none the
less...