[Beowulf] Cluster install and admin approach (newbie question)

Tue Aug 25 17:11:36 PDT 2009

Greetings,

I relatively new to cluster environments and I was given a small
(7nodes+1head) cluster to admin. So far I only had to maintain what
was already installed so few problems to solve (and to think on). But
new (diferent: amd opteron vs intel xeon) machines came and I have to
expand the cluster (think and solve problems). The (old) cluster is
semi-diskless (all machines do have disks but they boot from a single
image on a central server) with nfs for filesystem sharing. The main
problems I had were:
 * if the /var filesystem is shared, race conditions happen (all nodes
want to write on the same files). I had this problem and moved to a
local /var filesystem.
 * if /var is local (which it may because the disks do exist), the
whole point of central point for easy admin vanishes, because I would
had to create all the /var structure that packages need to work, on
each node (would be easier to do: "for $node; ssh $install_cmd; done",
than guessing which dirs I need to create or files to copy).
 * if /var is tmpfs all forensics are certainly gone after failure
(Murphy told me this one ;).

Everything I read on the subject do underline the advantages of
diskless approaches but miss to alert to this problem and/or to solve
it. On the other side, the distributed approach tools (where every
node is autonomous) seem to be halted (as systemimager - which is used
in the Oscar project) or discontinued, or truly overblown for my
reference scale (IBM's xCat); so it really seems that I'm missing
something.

The question is what you do about this ?

Gil Brandao