[Beowulf] HPC fault tolerance using virtualization
hearnsj at googlemail.com
Tue Jun 16 07:21:53 PDT 2009
2009/6/16 Ashley Pittman <ashley at pittman.co.uk>
> > elements (or slots) allocated for the job on the node - if the VM is
> > able to adapt itself to such a situation, f.e. by starting several MPI
> > ranks and using shared memory for MPI communication. Further, to
> > cleanly stop the job, the queueing system will have to stop the VMs,
> > sending first a "shutdown" and then a "destroy" command, similar to
> > sending SIGTERM and SIGKILL today.
I will provide a counter-example here - I think that a lot of people have
thought about re-booting nodes every time they finish a job. There are codes
out there which leave processes running, or leave shared memory segments, if
the code is not properly terminated. I think everyone has had to run
clean-ipcs at some time!
Yes, you're right, the codes should be written properly and should not do
However it is very tempting to put a reboot in as a step following every
job, which means you get a machine in a known state for the next job.
Running virtual machines will make that easy (depends how long they take to
I agree with you about the 5% figure - the point I was making is that there
will come a point where the advantages of running a virtual machine will
outweigh a few percent of performance loss. Who knows where that point will
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf