[Beowulf] Compute Node OS on Local Disk vs. Ram Disk

Eric Thibodeau kyron at neuralbs.com
Tue Sep 30 17:39:29 PDT 2008


    I'm replying to Don's post since he outlines most of the reasons why 
I choose to use the NFS-mounted approach and let you choose weather or 
not you want a local disk(s) for scratch. Which brings up the _real_ 

- how many nodes
- are they all identical
- how many users concurrently using the cluster?
- do you have assigned full-time staff responsible for the cluster (as 
in, hired in-house staff that will be there to maintain the cluster).

    As an example, I'm a student managing a cluster at our department 
and converted it from disk-based RH 7.3 to NFS-booted Gentoo nodes. This 
has given me much flexibility and a very fast path to upgrade the nodes 
(LIVE!) since they would only need to be rebooted if I changed the 
kernel. I can install/upgrade the node's environment by simply chrooting 
into it and using the node's package manager and utilities as if it were 
a regular system). But I am in a special case where, if I break the 
cluster, I can fix it quickly and I always have a backup copy of the 
boot "root" image ready to switch to if my fiddling goes wrong. This 
also implies users aren't on the cluster when I arbitrarily decide to 
change the compiler from GCC-4.1.1 to GCC-4.3.2 ;) Hence the few points 
I mention above and the weighted importance of each of them.

    This said, one thing I haven't seen (explicitly) mentioned in all 
the replies is that you don't need a 1 to 1 correlation of OS/RAM, this 
is where you use Unionfs (or aufs) + an NFS-mounted root. I am currently 
writing up a document on how I accomplish this (Gentoo Clustering 
LiveCD), I'll give you a link to the beta version of the document if you 
want. The section describing the SSI (Single System Image) gives more 
details of what is discussed here.


Donald Becker wrote:
> On Sun, 28 Sep 2008, Jon Forrest wrote:
>> There are two philosophies on where a compute node's
>> OS and basic utilities should be located:
>> 1) On a local harddrive
>> 2) On a RAM disk
>> I'd like to start a discussion on the positives
>> and negatives of each approach. I'll throw out
>> a few.
>> Both approaches require that a compute node "distribution"
>> be maintained on the frontend machine. In both cases
>> it's important to remember to make any changes to this
>> distribution rather than just using "pdsh" or "tentakel"
>> to dynamically modify a compute node. This is so that the
>> next time the compute node boots, it gets the uptodate
>> distribution.
> Ahhh, your first flawed assumption.
> You believe that the OS needs to be statically provisioned to the nodes.
> That is incorrect.
> A compute node only needs what it will actually be running
>   - a kernel and device drivers that match the hardware
>   - kernel support for non-hardware-specific features (e.g. ext3 FS)
>   - a file system that presents a standard application environment
>     (The configuration files that the libraries depend upon 
>      e.g. a few files in /etc/*, a /dev/* that matches the hardware,
>      a few misc. directories)
>   - the application executable and libraries it links against
>   - application-specific file I/O environment (usually /tmp/ and a
>     few data directories)
> You can detect the first and most of the second category at node boot 
> time.  The kernel is loaded into memory and kernel modules are 
> immediately linked in, so there isn't any reason to keep them around as a 
> file system.
> The third category does need to be a file system, but it's tiny and 
> changes infrequently.  It can easily provisioned, or even dynamically 
> created, at node boot.
> The fourth category is an interesting one.  You don't have to statically 
> provision it at boot time, or mount a network file system.  When you issue 
> a process to a node, the system that accepts the process can check that 
> it has the needed executable and libraries.  Better, it can verify that it 
> has the correct versions.  And this is the best time to check, because we 
> can ask the sending machine for a current copy if we don't have the 
> correct version.  By having a model for "execution correctness" we 
> simultaneously eliminate one source of version skew and eliminate the need 
> to pre-load executables and libraries that will be unused or updated 
> before use.  Plus we automatically have a way to handling newly added 
> applications, libraries and utilities without rebooting compute nodes.
>> Assuming the actual OS image is the same in both cases,
>> #2 clearly requires more memory than #1.
> No, it can require substantially less.  It only requires more if you
> assume the naive approach of building a giant RAMdisk with everything you
> might need.  If you think of an alternative model where you are just
> caching the elements needed to do a job, the memory usage is less.
> Think of a compute node as part of a cluster, not a stand-alone machine.  
> The only times that it is asked to do something new (boot, accept a new
> process) it's communicating with a fully installed, up-to-date master 
> node.  It has, at least temporarily, complete access to a reference 
> install.  It can take that opportunity to cache or load elements that 
> doesn't have, or has an obsolete version of.
> There might be some dynamic elements needed later e.g. name service 
> look-ups, but these should be much smaller than the initial provisioning 
> and the correct/consistency model is inherently looser. 
>> Long ago not installing a local harddrive saved a considerable
>> about of money but this isn't true anymore. Systems that need
>> to page (or swap) will require a harddrive anyway since paging
>> over the network isn't fast enough so very few compute nodes
>> will be running diskless.
> The hardware cost of a local hard drive wasn't really an issue.  It has 
> always been the least expensive I/O bandwidth available.  The real cost is 
> installing, updating and backing up the drive.  If you design a cluster 
> system that installs on a local disk, it's very difficult to adapt it to 
> diskless blades.  If you design a system that is as efficient without 
> disks, it's trivial to optionally mount disks for caching, temporary files 
> or application I/O.
>> Approach #2 requires much less time when a node is installed,
>> and a little less time when a node is booted.
> We've been able to start diskless compute nodes in
>   <BIOS memory count> + <PXE 2 seconds> + 750 milliseconds  (!)
> To be fair, that was on blades without disk controllers, and just
> Ethernet.  Scanning for local disks, especially with a SCSI layer, can
> take many seconds.  Once you detect a disk it takes a bunch of slow seeks
> to read the partition table and mount a modern file system (not EXT2).  
> So trimming the system initialization time further isn't a priority until 
> after the file system and IB init times are shortened.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080930/30dcd7d4/attachment.html>

More information about the Beowulf mailing list