[Beowulf] HPC fault tolerance using virtualization

Tue Jun 16 09:14:04 PDT 2009

>
>
> Date: Tue, 16 Jun 2009 10:38:55 +0200
> From: Kilian CAVALOTTI <kilian.cavalotti.work at gmail.com>
>
> On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote:
>> It would be nice to be able to just move bad hardware out from  
>> under a
>> running job without affecting the run of the job.
>
> I may be missing something major here, but if there's bad hardware,  
> chances
> are the job has already failed from it, right? Would it be a bad  
> disk (and the
> OS would only notice a bad disk while trying to write on it, likely  
> asked to
> do so by the job), or bad memory, or bad CPU, or faulty PSU.  
> Anything hardware
> losing bits mainly manifests itself in software errors. There is  
> very little
> chance to spot a bad DIMM until something (like a job) tries to  
> write to it.

We have recently purchased "un-blade" systems that may fit into the  
missing list.  These are systems where multiple nodes are hard wired  
into a single chassis and in order to work on 1, all of them have to  
come offline.  The power efficiency and system costs are compelling,  
but the complexity of maintenance is a trade off we decided to try.   
If the Virtualization tax was low enough it would be useful, and make  
us more incented to use these more cost/power efficient options  
without creating huge maintenance hassles.

>
>
> So unless there's a way to detect faulty hardware before it affects  
> anything
> software, it's very likely that the job would have crashed already,  
> before the
> OS could pull out its migration toolkit.
IF the job is running against a large Networked File System, but the  
local *Real* OS is depending on the failing disk, the job could be  
migrated off when the OS starts detecting SCSI or Network (IB?)  
errors.  Same is true for some network issues.  Of course, who in  
their right mind would want an OS dependent on a local disk these  
days?  :)

Note: this is a Shameless plug for Perceus and all other such options  
that leave spinning disk for scratch/checkpointing or some other lower  
risk purpose... if any.

>
>
> The paper John mentioned is centered around IPMI for preventive fault
> detection. It probably works for some cases (where you can use  
> thresholds,
> like temperature probes or fan speeds), where IPMI detects hardware  
> errors
> before it affects the running job. But from what I've seen most  
> often, it's
> kind of too late, and IPMI logs an error when the job has crashed  
> already. And
> even if it didn't crash yet, what kind of assurance to you have that  
> the
> result of simulation has not been corrupted in some way by that  
> faulty DIMM
> you got?
Single Bit Errors likely won't corrupt the system, but it would be  
nice to handle them when the pop up, rather than waiting for  
maintenance windows or offlining the node and waiting for any jobs to  
drain off of it.  This would be a win for an admin to do maintenance  
on their own schedules and minimize the actual lost compute time of  
the machine.

>
>
> My take on this is that it's probably more efficient to develop  
> checkpointing
> features and recovery in software (like MPI) rather than adding a
> virtualization layer, which is likely to decrease performance.
I agree.  I was very excited about "Evergrid"'s (Now Librato?) notion  
of universal checkpointing... but I've never been able to get any time  
from/with them.  This seems like an approach for checkpointing that  
would work out very cleanly for many apps that are clueless on the  
notion of checkpointing.

Moral of the story:
There was a day when the OS was a huge consumer of a workstations  
resources (CPU/Memory/Disk) and as such a huge Tax.  Today it's a  
small fraction of the footprint, and so we worry less and less about  
it and it's efficiency except where it impacts the performance/ 
stability of the apps that depend on it.  My guess is that  
Virtualization is just an extension of that trend and will eventually  
be the way we need to go as the Tax of the OS / VM layers becomes more  
minimal.

Given that general trend, I am happy to see smart people who prefer  
graceful code and efficiency trying to steer these VM options toward  
low overhead solutions where I can firewall a bad code from leaving  
machines in a bad state for the next code that tries to run there.

Should the applications be better stewards of the environment they run  
in... yes.
Should the OS protect itself better from bad codes... yes.
Should the admin configure the OS better so that codes can't do bad  
things... yes.

But I can't control those, I can control my OS and give the apps their  
own OS via VM's if the Tax is low enough.  Anything different would be  
like saying we shouldn't need firewalls because the apps listening to  
the ports shouldn't be hackable.  It's true, but not something I want  
to try and control.

>
>
> Cheers,
> -- 
> Kilian
>

Cheers!
Greg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090616/656adf1e/attachment.html>