[Beowulf] best archetecture / tradeoffs

Sat Aug 27 09:48:27 PDT 2005

On 8/26/05, Seth Keith <skeith at deterministicnetworks.com> wrote:
> 2) message passing vs roll yer own
> 
> I have played with a few different packages, and written a bunch of perl
> networking code, and read a bunch and I am still not sure what is
> better. Please chime in:
> 
>     - what is the fastest way to run perl on worker nodes. Remember I
> don't need to do anything too fancy, just grab a bunch of workers, send
> jobs to them, assemble the results, send the results to another worker,
> etc. I don't need to broadcast to all nodes or anything else.
> 
>     - what is the easiest way to do it. I wrote the whole thing in perl
> already, and I was not really impressed with the speed or reliability.
> Certainly this was at least partially programmer error, but my question
> stands, what is the easiest way to reliably control a cluster of worker
> nodes running different perl programs, and assembling the data. This
> includes load balancing.

What is the typical runtime of each job?? If it takes 10 minutes or
more, may be it is worth using a batch system to get fault tolerance,
load balancing, job dependency, and job accounting.

You can use Gridengine (SGE) or Torque: both are opensource. SGE is
better for compute farm type of clusters, and it has a more active
user mailing list. But if you already know Torque/PBS, then just use
Torque.

With SGE, you can define job dependencies yourself:
> qsub worker
Your job 10 ("worker") has been submitted.
> qsub worker
Your job 11 ("worker") has been submitted.
> qsub worker
Your job 12 ("worker") has been submitted.
> qsub worker
Your job 13 ("worker") has been submitted.
> qsub -hold_jid 10 11 12 13 final
Your job 14 ("final") has been submitted.

So the final pass will not start until the worker (job 10-13) jobs finish.

A batch system is better than using MPI to do task distribution
because each job is independent, so while job 14 is waiting for job 13
to finish, job 15, 16, ... can start on nodes used by job 11, 12.
Also, if a node fails, the batch system can rerun the job on the node,
rather than rerunning the whole thing.

SGE: http://gridengine.sunsource.net
Torque: http://www.supercluster.org/projects/torque/

*However*, if each job takes less than a few minutes, the overhead of
using a batch system can kill you :(

Another thing is that if you want to have the ease of management of
diskless nodes but without a single point of failure at the server as
RGB pointed out, you can take a look at cluster toolkits like the
rocks cluster distribution:

http://www.rocksclusters.org/Rocks/

You can download "rolls" (packages) and install it on the headnode,
and the rest will be handled by Rocks. There are already SGE, PBS, and
may be Globus rolls.

Rayson

> 
>     - I saw some information on clusters that were linked in the kernel
> and acted as a single machine. Is this a working reality? How does the
> performance of such a system compare with message passing for dedicated
> processing such as my own.
> 
>     - I was playing with MPICH-2, is this better than LAM? What about
> other message passing libraries what is the best one?  any with direct
> hooks into perl?
> 
> 
>     - how fast is NFS and RSH. If I were to change the code so it works
> with a NFS mounted file instead of STDIN/OUT and I use RSH to
> communicate how would the speed compare with message passing? with
> direct perl networking?
> 
> 
> 3) Distribution and kernel
> 
> I create my NFS system by copying directories off my RH9 distribution. I
> had lots of problems and could never get everything working. I think it
> would be loads easier if I could find a standard distribution image
> already constructed somewhere out there... I don't really care what
> distribution as long as I can run perl.
> 
> I keep seeing people advising against the NFS root option and advocating
> ram disk images. Opinions here? Where can I get ram disk images? I would
> be nice to download a basic complete ram disk image, that boots with
> root rsh working already.
> 
> 
> 
> Well I guess that is enough for one day. Thank you for taking the time
> to read this email. If you have the time please send me your opinions on
> this stuff.
> 
> Thanks again.
> 
> -Seth