Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] HPC fault tolerance using virtualization

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

John Hearns hearnsj at googlemail.com
Tue Jun 16 07:21:53 PDT 2009


2009/6/16 Ashley Pittman <ashley at pittman.co.uk>

> >
> > elements (or slots) allocated for the job on the node - if the VM is
> > able to adapt itself to such a situation, f.e. by starting several MPI
> > ranks and using shared memory for MPI communication. Further, to
> > cleanly stop the job, the queueing system will have to stop the VMs,
> > sending first a "shutdown" and then a "destroy" command, similar to
> > sending SIGTERM and SIGKILL today.
>

I will provide a counter-example here - I think that a lot of people have
thought about re-booting nodes every time they finish a job. There are codes
out there which leave processes running, or leave shared memory segments, if
the code is not properly terminated. I think everyone has had to run
clean-ipcs at some time!
Yes, you're right, the codes should be written properly and should not do
this.
However it is very tempting to put a reboot in as a step following every
job, which means you get a machine in a known state for the next job.
Running virtual machines will make that easy (depends how long they take to
boot up)

I agree with you about the 5% figure - the point I was making is that there
will come a point where the advantages of running a virtual machine will
outweigh a few percent of performance loss. Who knows where that point will
be!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20090616/9423f358/attachment.html


More information about the Beowulf mailing list