[Beowulf] Should I go for diskless or not?
landman at scalableinformatics.com
Fri May 15 07:00:14 PDT 2009
Douglas Eadline wrote:
> You will note that I used sufficient wiggle words "usually" and
> "generally" because in my experience it always depends.
> And of course my comments are from my personal
> experience. I have found that diskless allows for
> the entire cluster to be "re provisioned" without
> have into re-image disks. Reboots are quicker
> (for the hardware I use) and since I use ram
> disk approach (Warewulf/Perceus) I find that things
> are a bit faster, also the diskless image has
> minimal services running vs a disk-full distribution image.
> (I fully understand the good admin can trim a
> disk-full distribution)
> There are plenty of arguments either way. Back in 2006
> I did a mini-poll on node disk space usage:
> So "in general, it varies". YMMV
Let me weigh in (late of course) with some thoughts.
With the advent of buzz words^H^H^H^H^H^H^H^H^H^H^Htechnologies like the
cloud(TM)(c)(SM)*, I have a sense that stateful (disk based) installs
are a thing of the past. Some folks will want them for their local
cluster, but when you are provisioning N nodes, you aren't going to want
to deal with the vagaries of doing a stateful install to "random" hardware.
For non-virtualized hosts, what you want is to either boot diskless in a
trivial to configure manner, or, and we are seeing and working on this
more and more, boot from iSCSI targets with the node OS implemented.
The latter is a very nice way to do non-virtualized OS boots. You don't
need to be able to PXE boot from iSCSI, though this is an option.
The advantage of diskless is you don't need to containerize the OS in an
"image" (which is little more than a 'dd if=/dev/zero of=disk.image
bs=1M count=8k ; losetup disk.image ; mkfs... /dev/loop0 ; mount
/dev/loop0 /mnt/disk ; #copy tree to disk' )
The disadvantage of diskless is some complexity in the boot/shutdown
process. This has been (mostly) managed well by a number of
distributions, so it is generally (these days) fairly easy to deploy
diskless systems. Swap is an issue. A somewhat hard to solve issue ...
we'd recommend actually turning off swap (and swappiness in the kernel)
for diskless. Or put a USB drive in each machine and swap on that,
though, honestly, that is as reliable as swapping over the network.
E.g. don't do it.
The Perceus hybrid approach works well, using a small local ramdisk, and
an NFS mount for the important stuff. No installation needed, just boot
it, and up you go. Same issues on swap.
Another way to look at local drives is a small swap space. So if you
can get drives in a $40/unit increments, these are cheap and fairly
reliable swap targets. Better than USB. But you shouldn't swap in a
The advantage of iSCSI is, it is a block device that you attach, over
the network, and it works pretty well. You can (easily)
programmatically adjust it, clone it, modify it, etc. Pull it down to a
different system (offline), make changes, upload the different image,
and try it on a few nodes. Basically iSCSI is sort of like diskless in
a way, though it allows greater ranges of changes without creating whole
new diskless trees, and without jumping through the host issues with
some diskless schemes. Its not a VM, but as with diskless, VMs can boot
from it. Making development fast/efficient. It allows you a VM-able,
and easily transportable container for your stuff.
Of course, it goes without saying that for either diskless or iSCSI, you
need a really good and fast storage infrastructure ... :)
More to the point, the stateful installations have their place, but it
doesn't appear to be in "cloud" scenarios. Stateful is fine for local
clusters, or automagically assembled larger clusters. I would be
mindful of the latter though, as we have had a number of customers with
problems directly attributable to unresolved issues in some cluster
distribution's installer code. If your installer has bugs, or does
something wrong, somewhere (like, I dunno, disabling the IPMI?), you can
wind up with N unbootable nodes. As N gets north of 8, this gets less
Other issues we have run into with stateful clusters have included the
cluster distribution getting in the way of what you want to do. Most
cluster distributions come with an overarching philosophy, and the "one
true way" to do things. Invariably you will find a need to do something
"an other way", and can run head-first into the philosophy (and
occasionally ... um ... animated [yeah, thats the ticket] ...
discussions with the disciples of that philosophy) as you try to figure
out what you need to do. Some of these philosophies are benign, some
The stateful case with specific distributions (RHEL and alike) doesn't
do so well with new hardware. We have had customers call us up to
complain that their shiny new nodes seem to crash regularly with the
(ancient) kernels in RHEL. Sure enough, the new-thingamabob on the
motherboard is not well supported in RHEL, so you either have to a) toss
your cluster and only buy from the HAL, or b) get a newer kernel. It
gets more ... exciting ... and not in a good way, when that
new-thingamabob impedes installation. We've seen this happen ... too
many times to count.
The solution is to slip in a new kernel and initrd to fix the problem.
Slip-streaming a new kernel into a RHEL distribution is an ... um ...
challenge ... to say the least.
In diskless, this is trivial. In iSCSI, it is the same as diskless (in
both cases, the boot kernel and boot initrd are outside of the trees).
What we do for stateful installs, for customers who don't care about
which cluster distribution is used, and let us use our Tiburon system,
we pxe boot from our kernel (184.108.40.206 based) and associated initrd.
Finishing scripts fix up any new kernel bits later on. Its harder to do
this with some other cluster systems. Easy with Perceus/Warewulf.
Basically, where I am getting to is, if you really want to go stateful,
make sure your kernel works with your hardware before you make a
decision about distro, etc. Less important with diskless/iSCSI. If you
go stateless, solve the swap problem with small cheap local drives.
* we are drinking the koolaid, and starting work with a partner on this
stuff. I do see some merit in it (virtualized or direct), though it
isn't for everyone in HPC. The folks for whom it will work well have a
specific usage profile.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf