[Beowulf] Newbie questions on cluster technology to use
kir at lapshin.net
Wed Jul 25 16:58:17 PDT 2007
This is my first post, please accept my apologies if questions are too
simple. I would appreciate pointers to documentation, writeups, howtos
etc. I did some research before posting here, but got lost in sheer
amount of information and competing technologies available.
We are planning to setup a cluster at my work place to handle some
computation heavy jobs we have and the main task at the moment is to
choose the right technology.
First of all, let me try to describe the task we have at hand.
1. There are a lot of relatively short jobs submitted by users. There
are also much longer jobs submitted automatically at a known schedule.
2. Even though jobs are short (take minutes to complete on single
machine) it is still important to parallelize each job to run them even
faster (order of tens of seconds). That's financial industry we are
talking about and time is money.
3. Jobs are quite easily parallelizable, probably embarrassingly so.
Simple master/slave pattern naturally applies here. We already have
parallel implementation running on a single host utilizing multiple
processors via threads. It would be nice to be able to do it over many
machines as well.
4. Jobs have to be scheduled properly, meaning that some users should
have higher priority than others and especially than automated long
running jobs, if user submits too many jobs his priority decreases, etc.
5. Implementation have to be fault tolerant, transparently surviving
individual machine failures. Transparently for users that is, it is Ok
to program tasks in a special way to get fault tolerance.
6. It would be nice to be able to submit "backup" tasks once job nears
completion just in case some nodes in cluster are running slow. E.g. if
job is split in 1000 tasks, runs on 16 node cluster and it is almost
done, there are just 4 tasks to finish the job and there are a lot of
idling nodes on cluster, scheduler could submit each of outstanding task
to two machines and pick up results from whichever one completes first.
If cluster is heterogeneous, or one node just runs slower it could
speedup job completion considerably. At least that's what I've read in
Google's mapreduce paper.
7. Some cluster health monitoring is needed. Does not have to be
sophisticated, but at least we should be able to learn easily that some
host has died and needs repairment. Statistics are nice to have as well
to be able to adjust user priorities, make decisions on buying new
8. The business is somewhat Windows centric, though I would try to push
Linux as a platform. It is doable, provided benefits are good. Linux
port of the program is not a problem.
Potential solutions I see:
1. TIBCO distributed queue. In short it is a proprietary solution that
more or less is a fault tolerant load balancing. The downside is absence
of any scheduler (works as FIFO) and the fact that it is proprietary. We
would much rather use open source technologies. See below for a bit of
info on TIBCO.
2. MPI with some scheduler (Condor?). From what I read looks like fault
tolerance is not easy to achieve in MPI world, and even if it is
possible, then failure on a master node will render whole cluster
unusable. I could be wrong on this, and I hope I am.
3. Torque? Grid Engine? Globus? Something else?
What are your suggestions? We need to decide on technology and try to
implement it, gaining more knowledge in the process and hopefully making
more informed decision in version 2 of our cluster. Any input would be
Some details on TIBCO. Tibco at heart is an enterprise messaging system,
which propagates information via broadcasts on the same subnet and can
route it from subnet to subnet via special daemon. It was mainly
designed to integrate various systems in enterprise via common pipe
where each system connects for data exchange, instead of building many
point-to-point connections between individual systems. On top of this
messaging technology they developed distributed queue, which works like
this: you start many copies of app on many machines, they all find each
other via broadcasts, elect a master among themselves, send heartbit
messages every now and then to monitor healths of nodes, once message
arrives, master chooses which worker should process it. If one of the
nodes dies, master resubmits his task to other node. If master dies,
remaining nodes elect new master and keep going from there. It is
possible since all communication is done via broadcasts, and every node
can maintain master's state.
More information about the Beowulf