<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Jon<br>

<br>

    I'm replying to Don's post since he outlines most of the reasons

why I choose to use the NFS-mounted approach and let you choose weather

or not you want a local disk(s) for scratch. Which brings up the _real_

questions:<br>

<br>

- how many nodes<br>

- are they all identical<br>

- how many users concurrently using the cluster?<br>

- do you have assigned full-time staff responsible for the cluster (as

in, hired in-house staff that will be there to maintain the cluster).<br>

<br>

    As an example, I'm a student managing a cluster at our department

and converted it from disk-based RH 7.3 to NFS-booted Gentoo nodes.

This has given me much flexibility and a very fast path to upgrade the

nodes (LIVE!) since they would only need to be rebooted if I changed

the kernel. I can install/upgrade the node's environment by simply

chrooting into it and using the node's package manager and utilities as

if it were a regular system). But I am in a special case where, if I

break the cluster, I can fix it quickly and I always have a backup copy

of the boot "root" image ready to switch to if my fiddling goes wrong.

This also implies users aren't on the cluster when I arbitrarily decide

to change the compiler from GCC-4.1.1 to GCC-4.3.2 ;) Hence the few

points I mention above and the weighted importance of each of them.<br>

<br>

    This said, one thing I haven't seen (explicitly) mentioned in all

the replies is that you don't need a 1 to 1 correlation of OS/RAM, this

is where you use Unionfs (or aufs) + an NFS-mounted root. I am

currently writing up a document on how I accomplish this (Gentoo

Clustering LiveCD), I'll give you a link to the beta version of the

document if you want. The section describing the SSI (Single System

Image) gives more details of what is discussed here.<br>

<br>

Eric<br>

<br>

Donald Becker wrote:

<blockquote

 cite="mid:Pine.LNX.4.44.0809301106460.1818-100000@bluewest.scyld.com"

 type="cite">

  <pre wrap="">On Sun, 28 Sep 2008, Jon Forrest wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">There are two philosophies on where a compute node's

OS and basic utilities should be located:

1) On a local harddrive

2) On a RAM disk

I'd like to start a discussion on the positives

and negatives of each approach. I'll throw out

a few.

Both approaches require that a compute node "distribution"

be maintained on the frontend machine. In both cases

it's important to remember to make any changes to this

distribution rather than just using "pdsh" or "tentakel"

to dynamically modify a compute node. This is so that the

next time the compute node boots, it gets the uptodate

distribution.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Ahhh, your first flawed assumption.

You believe that the OS needs to be statically provisioned to the nodes.

That is incorrect.

A compute node only needs what it will actually be running

  - a kernel and device drivers that match the hardware

  - kernel support for non-hardware-specific features (e.g. ext3 FS)

  - a file system that presents a standard application environment

    (The configuration files that the libraries depend upon 

     e.g. a few files in /etc/*, a /dev/* that matches the hardware,

     a few misc. directories)

  - the application executable and libraries it links against

  - application-specific file I/O environment (usually /tmp/ and a

    few data directories)

You can detect the first and most of the second category at node boot 

time.  The kernel is loaded into memory and kernel modules are 

immediately linked in, so there isn't any reason to keep them around as a 

file system.

The third category does need to be a file system, but it's tiny and 

changes infrequently.  It can easily provisioned, or even dynamically 

created, at node boot.

The fourth category is an interesting one.  You don't have to statically 

provision it at boot time, or mount a network file system.  When you issue 

a process to a node, the system that accepts the process can check that 

it has the needed executable and libraries.  Better, it can verify that it 

has the correct versions.  And this is the best time to check, because we 

can ask the sending machine for a current copy if we don't have the 

correct version.  By having a model for "execution correctness" we 

simultaneously eliminate one source of version skew and eliminate the need 

to pre-load executables and libraries that will be unused or updated 

before use.  Plus we automatically have a way to handling newly added 

applications, libraries and utilities without rebooting compute nodes.

  </pre>

  <blockquote type="cite">

    <pre wrap="">Assuming the actual OS image is the same in both cases,

#2 clearly requires more memory than #1.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

No, it can require substantially less.  It only requires more if you

assume the naive approach of building a giant RAMdisk with everything you

might need.  If you think of an alternative model where you are just

caching the elements needed to do a job, the memory usage is less.

Think of a compute node as part of a cluster, not a stand-alone machine.  

The only times that it is asked to do something new (boot, accept a new

process) it's communicating with a fully installed, up-to-date master 

node.  It has, at least temporarily, complete access to a reference 

install.  It can take that opportunity to cache or load elements that 

doesn't have, or has an obsolete version of.

There might be some dynamic elements needed later e.g. name service 

look-ups, but these should be much smaller than the initial provisioning 

and the correct/consistency model is inherently looser. 

  </pre>

  <blockquote type="cite">

    <pre wrap="">Long ago not installing a local harddrive saved a considerable

about of money but this isn't true anymore. Systems that need

to page (or swap) will require a harddrive anyway since paging

over the network isn't fast enough so very few compute nodes

will be running diskless.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

The hardware cost of a local hard drive wasn't really an issue.  It has 

always been the least expensive I/O bandwidth available.  The real cost is 

installing, updating and backing up the drive.  If you design a cluster 

system that installs on a local disk, it's very difficult to adapt it to 

diskless blades.  If you design a system that is as efficient without 

disks, it's trivial to optionally mount disks for caching, temporary files 

or application I/O.

  </pre>

  <blockquote type="cite">

    <pre wrap="">Approach #2 requires much less time when a node is installed,

and a little less time when a node is booted.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

We've been able to start diskless compute nodes in

  <BIOS memory count> + <PXE 2 seconds> + 750 milliseconds  (!)

To be fair, that was on blades without disk controllers, and just

Ethernet.  Scanning for local disks, especially with a SCSI layer, can

take many seconds.  Once you detect a disk it takes a bunch of slow seeks

to read the partition table and mount a modern file system (not EXT2).  

So trimming the system initialization time further isn't a priority until 

after the file system and IB init times are shortened.

  </pre>

</blockquote>

<br>

</body>

</html>