[Beowulf] HPC fault tolerance using virtualization
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Kilian CAVALOTTI kilian.cavalotti.work at gmail.comTue Jun 16 01:38:55 PDT 2009
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote: > It would be nice to be able to just move bad hardware out from under a > running job without affecting the run of the job. I may be missing something major here, but if there's bad hardware, chances are the job has already failed from it, right? Would it be a bad disk (and the OS would only notice a bad disk while trying to write on it, likely asked to do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything hardware losing bits mainly manifests itself in software errors. There is very little chance to spot a bad DIMM until something (like a job) tries to write to it. So unless there's a way to detect faulty hardware before it affects anything software, it's very likely that the job would have crashed already, before the OS could pull out its migration toolkit. The paper John mentioned is centered around IPMI for preventive fault detection. It probably works for some cases (where you can use thresholds, like temperature probes or fan speeds), where IPMI detects hardware errors before it affects the running job. But from what I've seen most often, it's kind of too late, and IPMI logs an error when the job has crashed already. And even if it didn't crash yet, what kind of assurance to you have that the result of simulation has not been corrupted in some way by that faulty DIMM you got? My take on this is that it's probably more efficient to develop checkpointing features and recovery in software (like MPI) rather than adding a virtualization layer, which is likely to decrease performance. Cheers, -- Kilian
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
