[Beowulf] HPC fault tolerance using virtualization

Tue Jun 16 03:27:18 PDT 2009

On Tue, 16 Jun 2009, John Hearns wrote:

> I believe that if we can get features like live migration of failing 
> machines, plus specialized stripped-down virtual machines specific 
> to job types then we will see virtualization becoming mainstream in 
> HPC clustering.

You might be right, at least when talking about the short term. It has 
been my experience with several ISVs that they are very slow in 
adopting newer features related to system infrastructure in their 
software - by system infrastructure I mean anything that has to do 
with the OS (f.e. taking advantage of CPU/mem affinity), MPI lib, 
queueing system, etc. So even if the MPI lib will gain features to 
allow fault tolerance, it will take a long time until they will be in 
real-world use.

By comparison, virtualization is something that the ISVs can 
completely offload to the sysadmins or system integrators, because 
neither the application nor the MPI lib (which is sometimes linked in 
the executable...) will have to be aware of it. The ISVs can then even 
choose what virtualization solution they "support".

Another aspect, which I have already mentioned some time ago, is that 
the ISV can much easier force the usage of a particular OS and 
environment, because this runs in the VM and is independent of what 
runs on the host. They can even provide a VM image which includes the 
OS, environment and application and declare this as the only supported 
configuration...  this is done already for non-parallel applications, 
but there's only one step needed for parallel ones: adapting it to the 
underlying network to get the HPC level of performance. I think that 
adapting to the queueing system is not really necessary from inside 
the VM; the queueing system can either start one VM per core or start 
one VM with several virtual CPUs to fill the number of processing 
elements (or slots) allocated for the job on the node - if the VM is 
able to adapt itself to such a situation, f.e. by starting several MPI 
ranks and using shared memory for MPI communication. Further, to 
cleanly stop the job, the queueing system will have to stop the VMs, 
sending first a "shutdown" and then a "destroy" command, similar to 
sending SIGTERM and SIGKILL today.

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de