[Beowulf] scheduler recommendations for a HPC cluster

Chris Samuel csamuel at vpac.org
Thu Oct 8 17:33:03 PDT 2009


----- "Rahul Nabar" <rpnabar at gmail.com> wrote:

> One thing that the Torque+Maui option is not the best is
> that it is not monolithic.

Actually from our point of view the really good part
of Torque is that the scheduler is pluggable and you
can have the very simple pbs_sched, Maui or Moab or
even write your own if you want using the examples!

> Oftentimes it is hard to know which component to blame
> for a problem or more relevant which config file to use
> to fix a problem. Torque or Maui.

We try and keep Torque *really* simple (just some
queues to let a couple of applications select a
walltime) and do all the smarts in Maui/Moab.

For what we do we have to use Moab, Maui didn't
have some of the capabilities we needed.

One thing we *really* like is the fact that Torque's
pbs_mom can run a health check script and then if that
reports an error (say "ERROR /tmp full") then it gets
passed back to the pbs_server and Moab will mark the
node as down until that error clears.

This keeps a node with problems from taking jobs
meaning you can get to work on it sooner.  Ours
checks everything from SMART errors, MCEs, disk
space through to if the node needs rebooting for
a kernel upgrade.

If you're not using Moab then you can instead simply
get your health check script to run pbsnodes to mark
the node offline (remembering to use the -N message
set an appropriate message).

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency



More information about the Beowulf mailing list