[Beowulf] HPC fault tolerance using virtualization

Tue Jun 16 07:21:53 PDT 2009

2009/6/16 Ashley Pittman <ashley at pittman.co.uk>

> >
> > elements (or slots) allocated for the job on the node - if the VM is
> > able to adapt itself to such a situation, f.e. by starting several MPI
> > ranks and using shared memory for MPI communication. Further, to
> > cleanly stop the job, the queueing system will have to stop the VMs,
> > sending first a "shutdown" and then a "destroy" command, similar to
> > sending SIGTERM and SIGKILL today.
>

I will provide a counter-example here - I think that a lot of people have
thought about re-booting nodes every time they finish a job. There are codes
out there which leave processes running, or leave shared memory segments, if
the code is not properly terminated. I think everyone has had to run
clean-ipcs at some time!
Yes, you're right, the codes should be written properly and should not do
this.
However it is very tempting to put a reboot in as a step following every
job, which means you get a machine in a known state for the next job.
Running virtual machines will make that easy (depends how long they take to
boot up)

I agree with you about the 5% figure - the point I was making is that there
will come a point where the advantages of running a virtual machine will
outweigh a few percent of performance loss. Who knows where that point will
be!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090616/9423f358/attachment.html>