[Beowulf] Should I go for diskless or not?

Fri May 15 07:00:14 PDT 2009

Douglas Eadline wrote:
> You will note that I used sufficient wiggle words "usually" and
> "generally" because in my experience it always depends.
> And of course my comments are from my personal
> experience. I have found that diskless allows for
> the entire cluster to be "re provisioned" without
> have into re-image disks. Reboots are quicker
> (for the hardware I use) and since I use ram
> disk approach (Warewulf/Perceus) I find that things
> are a bit faster, also the diskless image has
> minimal services running vs a disk-full distribution image.
> (I fully understand the good admin can trim a
> disk-full distribution)
> 
> There are plenty of arguments either way.  Back in 2006
> I did a mini-poll on node disk space usage:
> 
> http://www.clustermonkey.net//component/option,com_poll/task,results/id,18/
> 
> So "in general, it varies". YMMV

Let me weigh in (late of course) with some thoughts.

With the advent of buzz words^H^H^H^H^H^H^H^H^H^H^Htechnologies like the 
cloud(TM)(c)(SM)*, I have a sense that stateful (disk based) installs 
are a thing of the past.  Some folks will want them for their local 
cluster, but when you are provisioning N nodes, you aren't going to want 
to deal with the vagaries of doing a stateful install to "random" hardware.

For non-virtualized hosts, what you want is to either boot diskless in a 
trivial to configure manner, or, and we are seeing and working on this 
more and more, boot from iSCSI targets with the node OS implemented. 
The latter is a very nice way to do non-virtualized OS boots.  You don't 
need to be able to PXE boot from iSCSI, though this is an option.

The advantage of diskless is you don't need to containerize the OS in an 
"image" (which is little more than a 'dd if=/dev/zero of=disk.image 
bs=1M count=8k ; losetup disk.image ; mkfs... /dev/loop0 ; mount 
/dev/loop0 /mnt/disk ; #copy tree to disk' )

The disadvantage of diskless is some complexity in the boot/shutdown 
process.  This has been (mostly) managed well by a number of 
distributions, so it is generally (these days) fairly easy to deploy 
diskless systems.  Swap is an issue.  A somewhat hard to solve issue ... 
we'd recommend actually turning off swap (and swappiness in the kernel) 
for diskless.  Or put a USB drive in each machine and swap on that, 
though, honestly, that is as reliable as swapping over the network. 
E.g. don't do it.

The Perceus hybrid approach works well, using a small local ramdisk, and 
an NFS mount for the important stuff.  No installation needed, just boot 
it, and up you go.  Same issues on swap.

Another way to look at local drives is a small swap space.  So if you 
can get drives in a $40/unit increments, these are cheap and fairly 
reliable swap targets.  Better than USB.  But you shouldn't swap in a 
cluster.

The advantage of iSCSI is, it is a block device that you attach, over 
the network, and it works pretty well.  You can (easily) 
programmatically adjust it, clone it, modify it, etc.  Pull it down to a 
different system (offline), make changes, upload the different image, 
and try it on a few nodes.   Basically iSCSI is sort of like diskless in 
a way, though it allows greater ranges of changes without creating whole 
new diskless trees, and without jumping through the host issues with 
some diskless schemes.  Its not a VM, but as with diskless, VMs can boot 
from it.  Making development fast/efficient.  It allows you a VM-able, 
and easily transportable container for your stuff.

Of course, it goes without saying that for either diskless or iSCSI, you 
need a really good and fast storage infrastructure ... :)

More to the point, the stateful installations have their place, but it 
doesn't appear to be in "cloud" scenarios.  Stateful is fine for local 
clusters, or automagically assembled larger clusters.  I would be 
mindful of the latter though, as we have had a number of customers with 
problems directly attributable to unresolved issues in some cluster 
distribution's installer code.  If your installer has bugs, or does 
something wrong, somewhere (like, I dunno, disabling the IPMI?), you can 
wind up with N unbootable nodes.  As N gets north of 8, this gets less 
funny.

Other issues we have run into with stateful clusters have included the 
cluster distribution getting in the way of what you want to do.  Most 
cluster distributions come with an overarching philosophy, and the "one 
true way" to do things.  Invariably you will find a need to do something 
"an other way", and can run head-first into the philosophy (and 
occasionally ... um ... animated [yeah, thats the ticket] ... 
discussions with the disciples of that philosophy) as you try to figure 
out what you need to do.  Some of these philosophies are benign, some 
are aggressive.

The stateful case with specific distributions (RHEL and alike) doesn't 
do so well with new hardware.  We have had customers call us up to 
complain that their shiny new nodes seem to crash regularly with the 
(ancient) kernels in RHEL.  Sure enough, the new-thingamabob on the 
motherboard is not well supported in RHEL, so you either have to a) toss 
your cluster and only buy from the HAL, or b) get a newer kernel.  It 
gets more ... exciting ... and not in a good way, when that 
new-thingamabob impedes installation.  We've seen this happen ... too 
many times to count.

The solution is to slip in a new kernel and initrd to fix the problem. 
Slip-streaming a new kernel into a RHEL distribution is an ... um ... 
challenge ... to say the least.

In diskless, this is trivial.  In iSCSI, it is the same as diskless (in 
both cases, the boot kernel and boot initrd are outside of the trees).

What we do for stateful installs, for customers who don't care about 
which cluster distribution is used, and let us use our Tiburon system, 
we pxe boot from our kernel (2.6.28.7 based) and associated initrd. 
Finishing scripts fix up any new kernel bits later on.  Its harder to do 
this with some other cluster systems.  Easy with Perceus/Warewulf.

Basically, where I am getting to is, if you really want to go stateful, 
make sure your kernel works with your hardware before you make a 
decision about distro, etc.  Less important with diskless/iSCSI.  If you 
go stateless, solve the swap problem with small cheap local drives.

*  we are drinking the koolaid, and starting work with a partner on this 
stuff.  I do see some merit in it (virtualized or direct), though it 
isn't for everyone in HPC.  The folks for whom it will work well have a 
specific usage profile.
-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615