[Beowulf] HPC fault tolerance using virtualization

John Hearns hearnsj at googlemail.com
Mon Jun 15 10:59:37 PDT 2009


I was doing a search on ganglia + ipmi (I'm looking at doing such a
thing for temperature measurement) when I cam across this paper:

http://www.csm.ornl.gov/~engelman/publications/nagarajan07proactive.ppt.pdf

Proactive Fault Tolerance for HPC using Xen virtualization

Its something I've wanted to see working - doing a Xen live migration
of a 'dodgy' compute node, and the job just keeps on trucking.
Looks as if these guys have it working. Anyone else seen similar?

John Hearns



More information about the Beowulf mailing list