Is there any work management tools like that.
becker at scyld.com
Tue Jul 30 09:01:54 PDT 2002
On Tue, 30 Jul 2002, William Thies wrote:
> We need such kind of work management tools working on
> a 32-node cluster.
> 1. We will always run a very large master-slave
> program on this cluster.
> 2. Sometimes, we need to use this cluster to do other
Most any scheduling system can handle this kind of job allocation, at
least for new jobs.
The devil is in the details. For the large job workload, is that job a
number of short-lived independent processes, or a single
job with many long-lived communicating processes?
> (1) We want to power off 8 nodes first,
Why power off? You can use WOL or IPMI, but that power-cycle will take
on the order of minutes -- far longer than scheduling, and significantly
longer than other approaches to clearing the machine state. The Scyld
system can clear the machine state in just a few seconds.
> And at that time we don't want the GA program to use those 8 nodes
Every scheduling system can prevent jobs #1 from allocating new
processes on the reserved nodes. The question is, what happens to
the processes of job #1?
Are they short-lived enough that they will terminate naturally in a
Can the slave processes just be suspended?
Do you expect the system to check-point and restart them later?
(If so, what about the non-check-pointed processes they are
Do you expect the system to migrate them to another node?
(Again, what are you communication expectations?)
Can the processes be signalled to check-point or migrate itself?
(Scyld Beowulf provides tools to make this very easy, but it's not
a common feature on other scheduling system.)
> 3. This should be a multi-user management tool.
> Would you like to recommend some tools like that?
> Thanks very much!
We provide a queuing, scheduling and node allocation systems(*) that can
accomplish this within a cluster. If you need site-wide scheduling
(multiple OSes, a mix of cluster and independant nodes, crossing
firewall boundaries, etc) you should look at PBSPro, LSF, and SGE.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf