[Beowulf] best archetecture / tradeoffs

Tim Mattox tmattox at gmail.com
Tue Aug 30 09:41:07 PDT 2005


Hi Mark, other Beowulfers, and the Warewulfers,
Mark, I'll answer your questions inline.  Disclaimer, I am one of
the Warewulf developers/users.  Greg Kurtzer is the originator
and lead developer of Warewulf.

On 8/30/05, Mark Hahn <hahn at physics.mcmaster.ca> wrote:
> > This VNFS resides on the master/boot server, and is used to construct
> > the root filesystem for each node in your cluster.  You can make changes
> > to this template directly using chroot, or indirectly with other scripts/tools
> > for example:  rpm --root /vnfs/default ..., or yum
> > --installroot=/vnfs/default ...
> 
> heh, that's exactly what I do, but never thought to give it a name.
> or rather, I thought "nfsroot" pretty well covered it (and rejoiced
> that rpm/yum have those switches.)

In Warewulf's case, we thought that we needed a name that was
more specific than "chroot", so I think it was Geoff Galitz, another
WW developer, came up with the name VNFS.

> so in what sense is it a virtual node FS?  how is it different from
> the fairly common practice of an NFS-mounted root filesystem?

When you chroot into the VNFS on the master, your commandline
is now acting as if it's on a virtual node.  Any sysadmin config
changes you do, such as with ntsysv, chkconfig, or edits to files
in /etc will, eventually*, show up on the actual nodes in your cluster.
So, rather than logging into a node to make sysadmin changes, you
chroot into the "virtual node" file system to make them...  I guess it
is very much like what you do now for NFS-root.
*eventually: With an NFS-root, such as done in onesis, changes would
appear immediately (once NFS noticed) on the nodes.  With warewulf,
your changes don't show up on the nodes until you activate them with
a (short) sequence of commands.  The NFS-root approach has
the advantage of immediate feedback... but has the disadvantage
of taking down your entire cluster with a bad typo...  With Warewulf,
you can test your changes on a single real node before rebooting
all the nodes with the new/changed VNFS.

There are other differences as well.  The VNFS is a template for
the node's file system.  It's not exactly what the node will see.
The warewulf scripts will fill in and update a few critical config
files before they are sent to the nodes.  And, a customisable list
of files/directories can be included from the master's FS, and
also a customizable list of files/directories can be excluded from
the VNFS.

Also the node's kernel and it's modules are not kept in the VNFS
by default, since in most cases, the nodes would be running the
same kernel as the master, so the kernel and modules are
taken from the master's filesystem. We have been fiddling with
this last bit some, so don't take it as Warewulf "gospel"...
And, due to the chain of events when a warewulf node boots,
the kernel and modules are given to the node prior to it having
access to the VNFS (at least in Warewulf 2.4 and later).  I could
explain Warewulf's boot sequence in detail in another e-mail if
desired.

Oh, so you may wonder what config files are twiddled with before
the node sees them, so here is a list I just extracted from the source:
  /etc/hosts
  /root/.rhosts
  /etc/sysconfig/wulfd
  /etc/syslog.conf
  /etc/fstab
And, a few files are updated by the node itself when it boots, or
after it boots.  Specifically, the "wwnodes --sync" command is executed
after the nodes are up, but before they are available for general use. 
(This is how you update the list of allowed users, the list of nodes, etc.)

The default files included from the master (if they exist) are:
  /etc/motd
  various Myrinet files
  various infiniband files
  two files for FNN support

Sure those files could be put into the VNFS, but they tend to
also be needed on the master, so why update things in two places.
Again, as is common throughout much of Warewulf, if your specific
setup needs to work differently, this list of included files is configurable.

> > recent versions, we introduced a hybrid NFS/ramdisk scheme that reduces
> > the permanent RAM footprint dramatically by using a readonly NFS mount
> > of the VNFS for non-critical files. Thus, you can have your full blown text
> 
> presumably just most bits of /var, no?  are there less obvious bits that
> you feel need to be in the ramdisk?

Actually, no, we still keep significantly more than just /var in the tmpfs root.
I guess we are coming at it from the other side of the equation, removing
selected files and directories from the tmpfs root.  So, we by default exclude
things like the RPM database, man pages, and much more if you select
the hybrid mode.  We don't want to waste RAM on the nodes, but we
lean towards scalability first, so that our boot/master can handle more nodes.

>  incidentally, do you use a ramdisk,
> or do you use initrd's cpio format, or do you simply populate a tmpfs
> during the boot process?  I do the latter - it seemed simplest, once the
> initrd is under way, and has the NFS root mounted, to just mount tmpfs
> here and there and untar (the tar is in the NFS too...)

Yeah, sometime in the Warewulf 2.x versions, we switched to populating
a tmpfs rather than runing the system from the initrd.  In the 2.4.x versions,
we have a variety of ways of obtaining the VNFS tarfile which populates
the node's real root fs, including wget, dolly, rsync, and torrent.
With some tweaking of the wwinitrc script in Warewulf 2.4 it is also
possible to have the node's root fs reside on a local disk rather
than in a tmpfs mount.  I have not tried this myself... I don't have disks
in most of my nodes.

> > editor, compiler, and X installed in the VNFS, yet still have only a 15 to
> > 30 MB ramdisk on the nodes.  Which files reside on the ramdisk vs. which
> 
> hmm, 15M seems fairly elaborate, or do you not pivot/umount away your
> boot code?

Yes we do pivot and unmount the boot code.  The bulk of the ram footprint
comes from /lib (dynamic libraries) and the rest is mostly in /usr,
/sbin, and /bin
 
> > created.  Thus, you can get much of the small-RAM-footprint benefit of
> > the NFS-root scheme, yet have dramatically lower NFS traffic to the server
> > during normal cluster use.
> 
> I'd heard people say that was a problem, but haven't found it so.  what files
> are inadequately cached by NFS and wind up causing noticable traffic?

I've not done a scientific study of this, but I did use "lsof" to find various
files that were actively causing NFS traffic when I used to do NFS-root.
Not large data traffic, but more status/attribute traffic.  The bulk data would
get cached, but NFS would keep checking if a file/directory had changed
status.  At least that was what I remember from several years ago.
Maybe NFS-root has become less chatty since 2001 or so.

How many nodes can you safely handle with NFS-root?  I use a single
Warewulf master on 192 nodes (KLAT2 + KASY0), and I know there are
bigger installations.  The design goal is to scale to a thousand+ nodes
from a single boot/master machine.  My NFS-root experience stopped
at 64 nodes, the original configuration of KLAT2, and it worked fine
as far as the load on the boot/master at that time, but it seemed to
be close to the fraility limits of circa 2001 NFS.

> it seems like starting a new job would read little more than the user's
> shell, some shared libraries, /etc/passwd and friends.  I haven't tried to
> collect traces, but they seem quite NFS-caching-friendly...

Yes, the data side of those files is NFS-caching-friendly.  But because
NFS has no true coherency protocol, each node must periodically
check on the attributes of each open NFS file looking for possible
changes (even for readonly files).  And, with all the important shared
libraries already on the ramdisk, job startup in Warewulf can be very
fast.  If you find there is some rarely used, but large, shared library
installed in your VNFS, you can add it to the excludes file, and it would
be obtained over NFS rather then from the ramdisk (when in hybrid mode).

To recap, in Warewulf through the magic of symlinks, we have the flexibility
on a per file and per directory basis to choose if a file comes from a ramdisk
or from a readonly NFS mount of the VNFS.  So we can tradeoff the RAM
footprint with scalability and speed as we see fit.  Our defaults in
Warewulf
are more for speed and don't care as much about the RAM footprint.
I periodically try to update our default excludes file to trim out any
"fat", but
as RAM prices continue to fall, I worry less and less about finding files
to exclude.

> regards, mark hahn.

Thanks Mark for your comments and questions.  If we can
learn from the best practices of NFS-root users/developers, we can
make Warewulf that much better.
-- 
Tim Mattox - tmattox at gmail.com
  http://homepage.mac.com/tmattox/
    I'm a bright... http://www.the-brights.net/




More information about the Beowulf mailing list