<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><blockquote type="cite"><div><font class="Apple-style-span" color="#000000"><br></font><br>Date: Tue, 16 Jun 2009 10:38:55 +0200<br>From: Kilian CAVALOTTI <<a href="mailto:kilian.cavalotti.work@gmail.com">kilian.cavalotti.work@gmail.com</a>><br><br>On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote:<br><blockquote type="cite">It would be nice to be able to just move bad hardware out from under a<br></blockquote><blockquote type="cite">running job without affecting the run of the job.  <br></blockquote><br>I may be missing something major here, but if there's bad hardware, chances <br>are the job has already failed from it, right? Would it be a bad disk (and the <br>OS would only notice a bad disk while trying to write on it, likely asked to <br>do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything hardware <br>losing bits mainly manifests itself in software errors. There is very little <br>chance to spot a bad DIMM until something (like a job) tries to write to it.</div></blockquote><div><br></div><div>We have recently purchased "un-blade" systems that may fit into the missing list.  These are systems where multiple nodes are hard wired into a single chassis and in order to work on 1, all of them have to come offline.  The power efficiency and system costs are compelling, but the complexity of maintenance is a trade off we decided to try.  If the Virtualization tax was low enough it would be useful, and make us more incented to use these more cost/power efficient options without creating huge maintenance hassles.</div><div><br></div><blockquote type="cite"><div><br><br>So unless there's a way to detect faulty hardware before it affects anything <br>software, it's very likely that the job would have crashed already, before the <br>OS could pull out its migration toolkit.</div></blockquote><div>IF the job is running against a large Networked File System, but the local *Real* OS is depending on the failing disk, the job could be migrated off when the OS starts detecting SCSI or Network (IB?) errors.  Same is true for some network issues.  Of course, who in their right mind would want an OS dependent on a local disk these days?  :)  </div><div><br></div><div>Note: this is a Shameless plug for Perceus and all other such options that leave spinning disk for scratch/checkpointing or some other lower risk purpose... if any.</div><div><br></div><div><br></div><blockquote type="cite"><div><br><br>The paper John mentioned is centered around IPMI for preventive fault <br>detection. It probably works for some cases (where you can use thresholds, <br>like temperature probes or fan speeds), where IPMI detects hardware errors <br>before it affects the running job. But from what I've seen most often, it's <br>kind of too late, and IPMI logs an error when the job has crashed already. And <br>even if it didn't crash yet, what kind of assurance to you have that the <br>result of simulation has not been corrupted in some way by that faulty DIMM <br>you got?</div></blockquote><div>Single Bit Errors likely won't corrupt the system, but it would be nice to handle them when the pop up, rather than waiting for maintenance windows or offlining the node and waiting for any jobs to drain off of it.  This would be a win for an admin to do maintenance on their own schedules and minimize the actual lost compute time of the machine.</div><br><blockquote type="cite"><div><br><br>My take on this is that it's probably more efficient to develop checkpointing <br>features and recovery in software (like MPI) rather than adding a <br>virtualization layer, which is likely to decrease performance. </div></blockquote><div>I agree.  I was very excited about "Evergrid"'s (Now Librato?) notion of universal checkpointing... but I've never been able to get any time from/with them.  This seems like an approach for checkpointing that would work out very cleanly for many apps that are clueless on the notion of checkpointing.</div><div><br></div><div>Moral of the story:</div><div>There was a day when the OS was a huge consumer of a workstations resources (CPU/Memory/Disk) and as such a huge Tax.  Today it's a small fraction of the footprint, and so we worry less and less about it and it's efficiency except where it impacts the performance/stability of the apps that depend on it.  My guess is that Virtualization is just an extension of that trend and will eventually be the way we need to go as the Tax of the OS / VM layers becomes more minimal.  </div><div><br></div><div>Given that general trend, I am happy to see smart people who prefer graceful code and efficiency trying to steer these VM options toward low overhead solutions where I can firewall a bad code from leaving machines in a bad state for the next code that tries to run there.  </div><div><br></div><div>Should the applications be better stewards of the environment they run in... yes.  </div><div>Should the OS protect itself better from bad codes... yes.  </div><div>Should the admin configure the OS better so that codes can't do bad things... yes.  </div><div><br></div><div>But I can't control those, I can control my OS and give the apps their own OS via VM's if the Tax is low enough.  Anything different would be like saying we shouldn't need firewalls because the apps listening to the ports shouldn't be hackable.  It's true, but not something I want to try and control.</div><br><blockquote type="cite"><div><br><br>Cheers,<br>-- <br>Kilian<br><br></div></blockquote><div><br></div><div>Cheers!</div><div>Greg</div><div><br></div></div><br></body></html>