[Beowulf] statless compute nodes

Mon Jun 1 09:29:19 PDT 2015

> I was wondering how stateless node fair with very memory intensive
>applications. Does it simply require you to have a large amount of RAM to
>house your file system and program data? or are there other limitations?

it depends.  stateless just means stuff isn't installed on local disks;
it can be made available two main ways: as an rdimage (single blob provided
at boot time via PXE) or as some network-mounted filesystem.

I like the latter: you boot mostly just the kernel via PXE,
then pivot to a readonly NFS filesystem.  it's fun to know that
script kiddies will struggle a little, and makes your nodes 
inherently in-sync.  there are generally some files that need 
to be per-node: either customized per-node and/or writable:
we use OneSIS to handle this.  it makes symlinks during the boot
process: to a tmpfs for writability, or to a per-node file for 
customization.

very little memory overhead to this general approach, and we rarely
worry about NFS overload (one NFS server can handle up to about 1k nodes).
remember that NFS is reasonably good about read-caching, which means
that (eg) the bash binary is not going to hammer the NFS server, since 
most of its pages are clean mapped pages, thus both probably cached.

the ugly part is how you overlay onto a RO filesystem - this is not 
a very solved issue, even outside the world of stateless HPC clusters.
things like docker, for instance, tend to flail around with methods
like union mounts.  OneSIS uses a cluster-wide config file that contains
an entry for each "non-RO-ness" (mixed in with other parameters.)
the best thing about OneSIS is that it's pretty low-tech and obvious.

in the HPC world (well, my area of shared academic HPC), it's pretty 
unclear where things are going.  there are obvious relations between 
node provisioning and systems like modules (which effectively provision
the user/job/process namespace).  if you have all your packages organized
under /opt/something, your modules don't have much work to do (fiddle
with PATH et al).  obviously mechanisms like cgroups, even if you start
using them for resource control, are components of full-blown containers,
and put a different spin on the job-namespace issue.  just one small 
step from there to full-on VMs (which *usually* boot their whole image
from a (possibly networked) block-device, but certainly could boot a 
generic image that NFS mounts a cluster filesystem, etc...)

regards, mark hahn.