The good news...<br><br>We (IBM) demonstrated such a system at SC08 as a Cloud Computing demo.  The setup was a combination of Moab, xCAT, and Xen.<br><br>xCAT is an open source provisioning system that can control/monitor hardware, discover nodes, and provision stateful/stateless physical nodes and virtual machines.  xCAT supports both KVM and Xen including live migration.<br>

<br>Moab was used as a scheduler with xCAT as one of Moab's resource managers.  Moab uses xCAT's SSL/XML interface to query node state and to tell xCAT what to do.<br><br>Some of the things you can do:<br><br>1.  Green computing.  Provision nodes on-demand as needed with any OS (Windows too).  E.g. Torque command line:  qsub -l nodes=10:ppn=8,walltime=10:00:00,os=rhimagea.  Idle rhimagea nodes will be reused, other idle or off nodes will be provisioned with rhimagea.  When Torque checks in the job starts.  For this to be efficient all node images including hypervisor images should be stateless.  For Windows we use preinstalled iSCSI images (xCAT uses gpxe to simulate iSCSI HW on any x86_64 node).  When nodes are idle for more than 10 minutes Moab instructs xCAT to power off the nodes (unless something in the queue will use them soon).  Since it's stateless there is no need for cleanup.  I have this running on a 3780 diskless node system today.<br>

<br>2.  Route around problems.  If a dynamic provision fails, it will try another node.  Moab can also query xCAT about the HW health of the machine and opt to avoid using nodes that have an "amber" light.  Excessive ECCs, over temp, etc... are events that our service processors log.  If a threshold is reached the node is marked "risky", or "doomed to fail".  Moab policies can be setup to determine how to handle nodes in this state, e.g. Local MPI jobs--no risky nodes.  Grid jobs from another University--ok to use risky nodes.  Or, setup a reservation and email someone to fix it.<br>

<br>3. Virtual machine balancing.  Since xCAT can live migrate Xen, KVM, (and soon ESX4) and since it provides a programmable interface, Moab has no problem moving VMs around based on policies.  Combine this with the above two examples and you can move VMs if a HW warning is issued.  You can enable green to consolidate VMs and power off nodes.  You can query xCAT for node temp and do thermal balancing.<br>

<br>The above is just a few ideas that we are pursuing with our customers today.<br><br>The bad news...<br><br>I have no idea the state of VMs on IB.  That can be an issue with MPI.  Believe it or not, but most HPC sites do not use MPI.  They are all batch systems where storage I/O is the bottleneck.  However, I have tested MPI over IP with VMs and moved things around.  No problem.  Hint:  You will need a large L2 network since the VMs retain their MAC and IP.  Yes there are workarounds, but nothing as easy as a large L2.<br>

<br>Application performance may suffer in a VM.  Benchmark first.  If you just use #1 and #2 above on the iron, you can decrease your risk of failure and run faster.  And we all check point, right?  :-)<br><br>Lastly checkout <a href="http://lxc.sourceforge.net/">http://lxc.sourceforge.net/</a>.  This is light weight virtualization.  Its not a new concept, but hopefully by next year automated check point/restart with MPI jobs over IB may be supported.  This may be a better fit for HPC than full-on virtualization.<br>

<br><br><div class="gmail_quote">On Mon, Jun 15, 2009 at 11:59 AM, John Hearns <span dir="ltr"><<a href="mailto:hearnsj@googlemail.com">hearnsj@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

I was doing a search on ganglia + ipmi (I'm looking at doing such a<br>

thing for temperature measurement) when I cam across this paper:<br>

<br>

<a href="http://www.csm.ornl.gov/%7Eengelman/publications/nagarajan07proactive.ppt.pdf" target="_blank">http://www.csm.ornl.gov/~engelman/publications/nagarajan07proactive.ppt.pdf</a><br>

<br>

Proactive Fault Tolerance for HPC using Xen virtualization<br>

<br>

Its something I've wanted to see working - doing a Xen live migration<br>

of a 'dodgy' compute node, and the job just keeps on trucking.<br>

Looks as if these guys have it working. Anyone else seen similar?<br>

<font color="#888888"><br>

John Hearns<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</font></blockquote></div><br>