Uses for a beowulf cluster?

Robert G. Brown rgb@phy.duke.edu
Mon, 14 Sep 1998 12:56:06 -0400


On Mon, 14 Sep 1998, Robert G. Brown wrote:

On Mon, 14 Sep 1998, Jeffrey Moyer wrote:

> 	calls.  IMHO, this would be an acceptable solution.  Any other
> 	ideas?

It is my understanding that this is why the Condor project exists:

http://www.cs.wisc.edu/condor/

Following this reply is an excerpt from its "Overview" section.

BTW, I agree with Greg for the most part on the ongoing discussion.  If
at all possible either a simple load balancing script or even just a
warning to all users to check and balance loads or expect delays is the
desirable solution (laissez faire is not a bad optimizer, actually,
except that it tends to punish the weak and incompetent:-).

However, if you are managing a public resource and thereby have lots of
very inhomogeneous users, or are sharing a distributed resource with
complicated rules as to what can run where and when and by whom, though,
a simple whomp to the side of the head with a manual isn't a convenient
method of behavioral correction and your script becomes too complex with
conditionals (Don't run on Joes machine during the day, Do run on
Sammy's machine anytime, Don't run on Sally's machine when the load
average is above five...).  At that point Condor or the like is worth
the effort to install and configure, since I think this is pretty much
what it does and a lot of wheels have gone into it that YOU won't have
to reinvent.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu

%< Snip Snip==================================================

What is Condor?

Condor is a software system that runs on a cluster of workstations to
harness wasted CPU cycles. A Condor pool consists of any number of
machines, of possibly different architectures and operating systems,
that are connected by a network. To monitor the status of the individual
computers in the cluster, certain Condor programs called the Condor
"daemons" must run all the time. One daemon is called the "master". Its
only job is to make sure that the rest of the Condor daemons are
running. If any daemon dies, the master restarts it. If a daemon
continues to die, the master sends mail to a Condor administrator and
stops trying to start it. Two other daemons run on every machine in the
pool, the "startd" and the "schedd". The schedd keeps track of all the
jobs that have been submitted on a given machine. The startd monitors
information about the machine that is used to decide if it is available
to run a Condor job, such as keyboard and mouse activity, and the load
on the CPU. Since Condor only uses idle machines to compute jobs, the
startd also notices when a user returns to a machine that is currently
running and removes the job.

One machine, the "central manager" (CM) keeps track of all the resources
and jobs in the pool. All of the schedds and startds of the entire pool
report their information to a daemon running on the CM called the
"collector". The collector maintains a global view, and can be queried
for information about the status of the pool. Another daemon on the CM,
the "negotiator", periodically takes information from the collector to
find idle machines and match them with waiting jobs. This process is
called a "negotiation cycle" and usually happens every five minutes.
(See figure 1).

 .... (See website for the rest)