[Beowulf] HPC fault tolerance using virtualization

Tue Jun 16 01:38:55 PDT 2009

On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote:
> It would be nice to be able to just move bad hardware out from under a
> running job without affecting the run of the job.  

I may be missing something major here, but if there's bad hardware, chances 
are the job has already failed from it, right? Would it be a bad disk (and the 
OS would only notice a bad disk while trying to write on it, likely asked to 
do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything hardware 
losing bits mainly manifests itself in software errors. There is very little 
chance to spot a bad DIMM until something (like a job) tries to write to it.

So unless there's a way to detect faulty hardware before it affects anything 
software, it's very likely that the job would have crashed already, before the 
OS could pull out its migration toolkit.

The paper John mentioned is centered around IPMI for preventive fault 
detection. It probably works for some cases (where you can use thresholds, 
like temperature probes or fan speeds), where IPMI detects hardware errors 
before it affects the running job. But from what I've seen most often, it's 
kind of too late, and IPMI logs an error when the job has crashed already. And 
even if it didn't crash yet, what kind of assurance to you have that the 
result of simulation has not been corrupted in some way by that faulty DIMM 
you got?

My take on this is that it's probably more efficient to develop checkpointing 
features and recovery in software (like MPI) rather than adding a 
virtualization layer, which is likely to decrease performance. 

Cheers,
-- 
Kilian