NFS

Wed Apr 4 05:51:36 PDT 2001

On Sun, 25 Mar 2001, Javier Iglesias wrote:

> Hi,
>
> I'm planning a cluster for the computer science institut I'm
> working for, then collecting as much information on beowulf as
> available, and first-time poster to this list.
>
> It looks like using NFS is a great administrator time saver,
> but comes with a high network finger print.
>
> Did someone tried to use 2 NICs per node configured so that :
> 1/ first is dedicated to interprocessor communication, while
> 2/ second is used for NFS, DB calls, ...
>
> Is this idea already known as bad/good one ?

When you say cluster, what kind of cluster?  If it is a cluster of
workstations (where by workstations I mean boxes with people sitting at
them and working one simple tasks while the CPU cranks through all the
unused time working on parallel distributed tasks) then this isn't a
crazy idea, although I'm not certain that it is worth all the effort
since the architecture won't be particularly good for moderate grained
synchronous tasks anyway and for coarse grained or embarrassingly
parallel tasks it won't matter (two NICS won't particularly help since
even one NIC likely won't be saturated).

If it is a "true beowulf" cluster (where by this I mean a cluster of
dedicated, probably headless nodes running only a single parallel task
distributed by e.g. a "head node" that can be thought of as the console
interface of a true parallel supercomputer) then if your parallel task
is sanely designed it probably STILL won't help, although it might.  The
reason for this is that a typical node at any given time will be running
a kernel, a very few bookkeeping daemons (maybe -- this is a design
decision) and your parallel IPC daemon (e.g. pvmd, lamd) and parallel
task(s).

When the job is started up, there will be a brief burst of NFS traffic
while the task itself is loaded and a brief burst of either local disk
or NFS traffic while shared libraries are loaded.  Thereafter, nearly
everything will run out of memory, either active or cache.  If you
arrange it so that the parallel task itself only reads or writes to disk
on the head node (which contains the actual disk) you can minimize disk
I/O.

This isn't the only way to arrange things.  Without going into lots of
boring details, if you put local disks on all the nodes, a toolset like
kickstart/dhpcd (along with e.g. "update" from yup or "up2date" from RH)
can be used to maintain low-maintenance precisely identical linux
installations on local node disks.  This is in many ways the best way to
proceed, if you don't mind forking out the extra $100 for a single local
disk.  That way nodes have local binaries and libraries, scratch space
for local copies of data or checkpointing or intermediate output, local
swap (which one should never use as swap per se but which may be useful
for the system to use to help optimize its buffer/cache subsystems).
NFS is only needed for e.g. the binary to be run and any initialization
data and then probably only once at the beginning and end of a
calculation.

Finally, there is the Scyld solution, which does away with NFS
altogether and ALSO simplifies node installation to the limit.

If you do have a task that is IPC bound (moderately grained, possibly
synchronous, relatively large messages) and plan to build a "true
beowulf" architecture you would likely be well-advised to consider using
Scyld and perhaps channel bonding -- bag NFS altogether, put e.g. a CD
in each node to boot the node directly into Scyld from CD, add 2-3
NICs/node connected to 2-3 switches to get fat pipes between nodes.

To summarize, using two NICs to split up IPCs and NFS (etc) >>may<< not
be a bad idea for a NOW/COW architecture (although you'd have to be
careful or you might find that it is a waste of money as it doesn't
improve performance), but is "probably" not what you want to do for a
beowulf architecture, where I have to be wishy-washy in the statement
because there are almost certainly task mixes and architectures that
will put the lie to any conclusive statement.  If you want a more
conclusive statement (because the above still confuses you) give me more
details of your intended architecture and tasks and I'll see what I can
do.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu