[Beowulf] best archetecture / tradeoffs

Sat Aug 27 12:18:10 PDT 2005

> > 1) diskless vs disk
> > 
> > I am thinking diskless is better. I don't worry about network traffic as 
> > much as power consumption, overall node cost, reliability, and ease of 
> > management. My nodes are all identical, so I figure diskless, right?  
> > Well I am having a few problems...
> 
> Diskless vs disk isn't THAT simple.  Diskless nodes rely on the server
> which gives you a single point of failure.  

this is surprisingly inconsequential.  early in one of my clusters,
the master would die routinely (hardware problem), and this would not 
effect jobs running on the 96 diskless slave nodes.  irritating for me,
sure, but hardly a big deal.

> configured to standalone.  In fact, a diskfull node cluster doesn't need
> to have any particular "head node" so that any node can go down and you
> can work with the remaining nodes.  Also, as you're finding out, there

diskless doesn't require any particular head node either.  yes, they're 
somewhat dependent on some NFS server(s) for their filesystems, but then
again, that's normally the case for /home, etc.  in reality, it's not hard
to build a bomb-proof NFS server, so I consider this argument weak.

> is a bit more of a learning curve for diskless when things don't work
> perfectly the first time.

one very nice thing about diskless clusters is that they are easier to 
mess with, reinstall, change, update, etc.  the only downside is the TFTP
and NFS traffic involved in booting, but that's simply not a big deal.

> > I still don't know exactly about swap. One of the clusters I set up was 
> > an NFS mounted root file system that did something with swap to 
> > /dev/loop0, but I don't really understand that, is the swap going onto 

swap across the network is asking for trouble.  you should evaluate whether
you actually need swap at all.  I advocate having a disk in nodes to handle
swap, actually, even though I'd rather *boot* as if diskless.

> > the nfs drive or is it just back into memory? What is the best ( fastest 
> > ) way to handle swap on diskless nodes that might sometimes be 
> > processing jobs using more than the physical RAM?

you need to seriously rethink such jobs, since actually *using* swap is 
pretty much a non-fatal error condition these days.

> conditions.  Networked remote disk even more so, if you manage to work
> this out. 

actually, swap over the network *could* make excellent sense, since,
for instance, gigabit transfers a page about 200x faster than a disk
can seek.  (I'm assuming that the "swap server" has a lot of ram ;)

> > Also, is it really true you need a separate copy of the root nfs drive 
> > for every node? I don't see why this is. I have it working with just one 

certainly not!  in fact, it's basically stupid to do that.  my diskless
clusters do not have any per-node shares, though doing so would simplify
certain things (/var mainly).

> writing to the same file(s) in /var?  What about files in /etc that give
> a system some measure of identity (e.g. ssh keys).  What about

why would all the nodes in a cluster need separate host keys?

> applications that create e.g. file-based locks in /tmp that can just

but /tmp can quite sensibly be a tmpfs...

> system just wrote.  So rolling your own single-exported-root cluster can
> work, or can appear to work, or can work for a while and then
> spectacularly fail, depending on just what you run on the nodes and how
> they are configured.

sorry Robert, but this is FUD.  a cluster of diskless nodes each mounting
a single shared root filesystem (readonly) is really quite nice, robust, etc.

> There are, however, ways around most of the problems, and there are at
> this point "canned" diskless cluster installs out there where you just
> install a few packages, run some utilities, and poof it builds you a
> chroot vnfs that exports to a cluster while managing identity issues for

canned is great if it does exactly what you want and you don't care to 
know what it's doing.  but the existence of canned systems does NOT mean
that it's hard!

> > 2) message passing vs roll yer own
> > 
> > I have played with a few different packages, and written a bunch of perl 
> > networking code, and read a bunch and I am still not sure what is 
> > better. Please chime in:

depends.  MPI is clearly aimed at number-crunching.  if that's not what
you're doing, then MPI is grotesquely bloated.  but think carefully about 
what sorts of infrastructural things you need to do to make it work well,
work reliably.

on one of my clusters, I wrote a job-starter daemon in perl.  it listens 
on a socket, and when it receives "r 42", it does an SQL lookup, figures 
out how to run job #42, and fires it up.  this kind of dedicated daemon
is dramatically faster than ssh or rsh, and if anything, more secure.

a follow-on cluster does it differently: I figured that since each node 
was going to be running sshd already, I'd use that to start the jobs.
so I have a forced command in ~root/.ssh/authorized_keys that runs 
a perl script that retrieves the necessary stuff from SQL and starts the job.
it's pretty tidy, I think, but in reality, it doesn't matter much because 
most clusters don't run hundreds of jobs per second per node.

> >     - what is the fastest way to run perl on worker nodes. Remember I 
> > don't need to do anything too fancy, just grab a bunch of workers, send 
> > jobs to them, assemble the results, send the results to another worker, 
> > etc. I don't need to broadcast to all nodes or anything else.

I have a version of rshd in perl I'd be happy to provide.  it's a lot faster
than the real rshd (presumably by avoiding PAM, syslog, etc).

> it is one way.  In perl, after all, TMTOWTDI no matter what it is, says
> it right there in the perl book.

yes, quite nice.  that's why I wrote a whole cluster/queueing system in perl ;)

> > nodes running different perl programs, and assembling the data. This 
> > includes load balancing.

goodness, why do you want load-balancing?  I thought you said your processing 
was compute-intensive?  (if it's compute intensive, you want to dedicate a 
cpu to each process.  it's really only if you have some sort of
non-compute-bound, usually interactive process that you want real
load-balancing.  otherwise, you just want to manage a list of free cpus.

> Seriously, if you want these features (especially in perl) you have to
> program them all in, and none of them are terrible easy to code.  The

I would have to say that the difficulty of this stuff is overblown.
getting a node to net-boot onto a RO nfs root is pretty trivial if you
know what you're doing, *or* if you look at a working example (LTSP, etc).
writing a daemon to field job requests is basically cut-and-paste.
combining that with a bit of SQL code is surprisingly simple.
every design decision has its drawbacks (including using SQL), but you just
have to try it to know whether it is worth it in the end.  for instance,
I chose SQL in part because it let me ignore challenges of a reliable 
data store, let me scale extremely well, and ultimately, all the job info
had to be in a DB eventually any way...

> Reliability of ANY sort of networking task (run through perl or not) --
> well, there are whole books devoted to this sort of thing and it is Not
> Easy.  Not easy to tell if a node crashes, not easy to tell if a network
> connection is down or just slow, not easy to restart if it IS down,
> especially if the node drops the connection because of some quirk in its
> internal state that -- being remote -- you can't see.  Not Easy.

depends.  I typically make sure my clusters have pretty reliable nodes 
and networking - it just doesn't seem to make sense to patch around 
unreliability.  given that, I don't think this is hard at all.

> OTOH, if you're using perl as a job distribution mechanism, you have
> lots of room to improve without resorting to scyld, and you can always
> try bproc (which is a GPL thing that is at the heart of scyld, or used

actually, a stripped down rsh seems to be about the same speed as bproc.