[Beowulf] Compute Node OS on Local Disk vs. Ram Disk
kyron at neuralbs.com
Tue Sep 30 17:39:29 PDT 2008
I'm replying to Don's post since he outlines most of the reasons why
I choose to use the NFS-mounted approach and let you choose weather or
not you want a local disk(s) for scratch. Which brings up the _real_
- how many nodes
- are they all identical
- how many users concurrently using the cluster?
- do you have assigned full-time staff responsible for the cluster (as
in, hired in-house staff that will be there to maintain the cluster).
As an example, I'm a student managing a cluster at our department
and converted it from disk-based RH 7.3 to NFS-booted Gentoo nodes. This
has given me much flexibility and a very fast path to upgrade the nodes
(LIVE!) since they would only need to be rebooted if I changed the
kernel. I can install/upgrade the node's environment by simply chrooting
into it and using the node's package manager and utilities as if it were
a regular system). But I am in a special case where, if I break the
cluster, I can fix it quickly and I always have a backup copy of the
boot "root" image ready to switch to if my fiddling goes wrong. This
also implies users aren't on the cluster when I arbitrarily decide to
change the compiler from GCC-4.1.1 to GCC-4.3.2 ;) Hence the few points
I mention above and the weighted importance of each of them.
This said, one thing I haven't seen (explicitly) mentioned in all
the replies is that you don't need a 1 to 1 correlation of OS/RAM, this
is where you use Unionfs (or aufs) + an NFS-mounted root. I am currently
writing up a document on how I accomplish this (Gentoo Clustering
LiveCD), I'll give you a link to the beta version of the document if you
want. The section describing the SSI (Single System Image) gives more
details of what is discussed here.
Donald Becker wrote:
> On Sun, 28 Sep 2008, Jon Forrest wrote:
>> There are two philosophies on where a compute node's
>> OS and basic utilities should be located:
>> 1) On a local harddrive
>> 2) On a RAM disk
>> I'd like to start a discussion on the positives
>> and negatives of each approach. I'll throw out
>> a few.
>> Both approaches require that a compute node "distribution"
>> be maintained on the frontend machine. In both cases
>> it's important to remember to make any changes to this
>> distribution rather than just using "pdsh" or "tentakel"
>> to dynamically modify a compute node. This is so that the
>> next time the compute node boots, it gets the uptodate
> Ahhh, your first flawed assumption.
> You believe that the OS needs to be statically provisioned to the nodes.
> That is incorrect.
> A compute node only needs what it will actually be running
> - a kernel and device drivers that match the hardware
> - kernel support for non-hardware-specific features (e.g. ext3 FS)
> - a file system that presents a standard application environment
> (The configuration files that the libraries depend upon
> e.g. a few files in /etc/*, a /dev/* that matches the hardware,
> a few misc. directories)
> - the application executable and libraries it links against
> - application-specific file I/O environment (usually /tmp/ and a
> few data directories)
> You can detect the first and most of the second category at node boot
> time. The kernel is loaded into memory and kernel modules are
> immediately linked in, so there isn't any reason to keep them around as a
> file system.
> The third category does need to be a file system, but it's tiny and
> changes infrequently. It can easily provisioned, or even dynamically
> created, at node boot.
> The fourth category is an interesting one. You don't have to statically
> provision it at boot time, or mount a network file system. When you issue
> a process to a node, the system that accepts the process can check that
> it has the needed executable and libraries. Better, it can verify that it
> has the correct versions. And this is the best time to check, because we
> can ask the sending machine for a current copy if we don't have the
> correct version. By having a model for "execution correctness" we
> simultaneously eliminate one source of version skew and eliminate the need
> to pre-load executables and libraries that will be unused or updated
> before use. Plus we automatically have a way to handling newly added
> applications, libraries and utilities without rebooting compute nodes.
>> Assuming the actual OS image is the same in both cases,
>> #2 clearly requires more memory than #1.
> No, it can require substantially less. It only requires more if you
> assume the naive approach of building a giant RAMdisk with everything you
> might need. If you think of an alternative model where you are just
> caching the elements needed to do a job, the memory usage is less.
> Think of a compute node as part of a cluster, not a stand-alone machine.
> The only times that it is asked to do something new (boot, accept a new
> process) it's communicating with a fully installed, up-to-date master
> node. It has, at least temporarily, complete access to a reference
> install. It can take that opportunity to cache or load elements that
> doesn't have, or has an obsolete version of.
> There might be some dynamic elements needed later e.g. name service
> look-ups, but these should be much smaller than the initial provisioning
> and the correct/consistency model is inherently looser.
>> Long ago not installing a local harddrive saved a considerable
>> about of money but this isn't true anymore. Systems that need
>> to page (or swap) will require a harddrive anyway since paging
>> over the network isn't fast enough so very few compute nodes
>> will be running diskless.
> The hardware cost of a local hard drive wasn't really an issue. It has
> always been the least expensive I/O bandwidth available. The real cost is
> installing, updating and backing up the drive. If you design a cluster
> system that installs on a local disk, it's very difficult to adapt it to
> diskless blades. If you design a system that is as efficient without
> disks, it's trivial to optionally mount disks for caching, temporary files
> or application I/O.
>> Approach #2 requires much less time when a node is installed,
>> and a little less time when a node is booted.
> We've been able to start diskless compute nodes in
> <BIOS memory count> + <PXE 2 seconds> + 750 milliseconds (!)
> To be fair, that was on blades without disk controllers, and just
> Ethernet. Scanning for local disks, especially with a SCSI layer, can
> take many seconds. Once you detect a disk it takes a bunch of slow seeks
> to read the partition table and mount a modern file system (not EXT2).
> So trimming the system initialization time further isn't a priority until
> after the file system and IB init times are shortened.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf