[Beowulf] Jeff Squayres MPI proposals

Christopher Samuel samuel at unimelb.edu.au
Thu Mar 3 15:30:23 PST 2016


On 04/03/16 06:40, Douglas Eadline wrote:

> Yes, failure needs to be option.

The Slurm folks have been working on failure management support for a
little while, the idea being you can have a pool of spare nodes to pick
from (or alternatively bargain with a scheduler for a node that's
currently busy to come free later on and then add it to the job,
potentially extending the walltime to make up for the shortfall).

A better description from someone with higher caffeination is here:

http://slurm.schedmd.com/nonstop.html

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci



More information about the Beowulf mailing list