[Beowulf] Request for comments: diskless cluster
smulcahy at aplpi.com
Tue Dec 18 08:41:18 PST 2007
Jorge Salamero Sanz wrote:
> Hi all,
> I'm going to move a 42-nodes beowulf to diskless mode (currently all local
> cloned installations).
> Which system / tools do you recommend to manage the client-images ?
> I was thinking on a debootstraped dir shared as NFS root. The differences
> between the nodes (/etc/hostname, /etc/fstab, /etc/exportfs ...) could be
> managed with unionfs.
> Debian has a couple of tools that could help (live-helper for making custom
> images) but maybe lessdisk would be more suitable. Which one do you use ?
> How do you manage this kind of cluster setup ?
We built a system like this for a customer a few years ago and it has
performed very well. We have a head-node which acts as a management node
and an NFS server for the diskless workstations in the cluster.
We used debbootstrap to build images for the diskless nodes. We opted to
keep separate disk images for each diskless node in order to keep things
simple - I'm sure you could do the same with unionfs or similar but
diskspace is cheap and the effort to put together something with unionfs
didn't seem to be justified at the time.
To add a new node, we simply copy the debootstrapped directory contents
and change the hostname.
Each diskless node uses PXE to boot and a monolithic kernel compiled
with just the basics needed for the compute nodes. We did some
experiments with initrd images and modular kernels but there were some
issues with Debian which caused us problems (see bugs
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=388761) this may have
been fixed since we last looked at it but again, the effort to fix it,
given that we had a working system wasn't justified.
You may need to use a separate network for PXE booting your nodes - we
experienced problems using a gigabit ethernet network for PXE booting
where nodes randomly failed to get a response from the DHCP server. I
suspect there was a bug in the PXE firmware which occasionally caused it
to fail while the network cards were negotiating gigabit speed (but I
have no evidence to back this up) - moving PXE booting to a separate
fast ethernet network resolved the problem.
I'm not familiar with live-helper or lessdisk, perhaps I need to do more
Hope this info is of some use, I've probably only covered some random
aspects of our config that spring to mind ...
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland. +353.91.751262 http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
More information about the Beowulf