[Beowulf] HPC fault tolerance using virtualization
kilian.cavalotti.work at gmail.com
Tue Jun 16 01:38:55 PDT 2009
On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote:
> It would be nice to be able to just move bad hardware out from under a
> running job without affecting the run of the job.
I may be missing something major here, but if there's bad hardware, chances
are the job has already failed from it, right? Would it be a bad disk (and the
OS would only notice a bad disk while trying to write on it, likely asked to
do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything hardware
losing bits mainly manifests itself in software errors. There is very little
chance to spot a bad DIMM until something (like a job) tries to write to it.
So unless there's a way to detect faulty hardware before it affects anything
software, it's very likely that the job would have crashed already, before the
OS could pull out its migration toolkit.
The paper John mentioned is centered around IPMI for preventive fault
detection. It probably works for some cases (where you can use thresholds,
like temperature probes or fan speeds), where IPMI detects hardware errors
before it affects the running job. But from what I've seen most often, it's
kind of too late, and IPMI logs an error when the job has crashed already. And
even if it didn't crash yet, what kind of assurance to you have that the
result of simulation has not been corrupted in some way by that faulty DIMM
My take on this is that it's probably more efficient to develop checkpointing
features and recovery in software (like MPI) rather than adding a
virtualization layer, which is likely to decrease performance.
More information about the Beowulf