[Beowulf] Re: HPC fault tolerance using virtualization
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Dave Love d.love at liverpool.ac.ukTue Jun 16 03:01:42 PDT 2009
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
John Hearns <hearnsj at googlemail.com> writes: > I was doing a search on ganglia + ipmi (I'm looking at doing such a > thing for temperature measurement) Like <URL:http://www.nw-grid.ac.uk/LivScripts?action=AttachFile&do=get&target=freeipmi-gmetric-temp>? If you want to take action, though, go direct to Nagios or similar with sensor readings, chassis health data, etc. > Its something I've wanted to see working - doing a Xen live migration > of a 'dodgy' compute node, and the job just keeps on trucking. > Looks as if these guys have it working. Anyone else seen similar? I don't understand what's wrong with using MPI fault tolerance. I recall testing LAM+BLCR and having processes migrate when SGE host queues were suspended, but I'm not in a position to try the Open-MPI version. Nothing short of checkpoints will help, anyway, when the node just dies, and that's the problem we see most often (e.g. because we were sold a shambolic Barcelona system with flaky hardware and an OS that doesn't support quad core properly). How does Xen perform generally, anyhow? Are there useful data on the HPC performance impact of Xen and/or KVM for, say, Ethernet NUMA systems? I've only seen it for non-NUMA Infiniband systems.
- Previous message: [Beowulf] HPC fault tolerance using virtualization
- Next message: [Beowulf] HPC fault tolerance using virtualization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
