[Beowulf] Re: motherboards for diskless nodesy

Fri Feb 25 10:29:35 PST 2005

On Thu, Feb 24, 2005 at 10:03:59PM -0500, Donald Becker wrote:
> 
> The specific problem here is very likely the PXE server implementation,
> not the client side.
> 

   Ah, thanks.  I'd read a bit about some of the alternate paths PXE used,
but didn't realize there were quite so many bugs in Intel's various
implementations.  I'm currently using Tim Hurman's PXE 1.42, ISC DHCPD 3.0.1
on the same segment (and if I think about it, I bet I'll find a race in
there somewhere), and Jean-Pierre Lefebvre's atftpd 3.0.1.  It's working
well on many of our systems, and works poorly for the largest class (the
compute nodes). The next time I have to do a major install I may try to get
it working better on whatever class of machine is being installed that day.

> 
> >Spending a couple gigs of that for a locally installed O/S isn't much of a
> >drama, especially on ~16 nodes.
> 
> But it's the long-term administrative effort that costs, not the disk
> hardware.  The need to maintain and update a persistent local O/S is the
> root of most of that cost.
> 

   For small clusters though, local admin just isn't that much of a burden
compared to the hassles of writing your own distributed filesystem, testing
images, scheduling reboots, or crashing long-term jobs.  I'm using Makefile
targets for each request, so I can later make the same changes to new nodes. 
eg:

gsl:
        for h in `cat nodes` ; \
                do \
                ssh $$h apt-get install -y libgsl0 libgsl0-dev; \
                 done

   There are more elaborate cluster management/install systems (FAI,
cfengine, ...), dsh to perform ssh in parallel, etc, but for a small
research cluster with installation requirements that change daily, being
able to make simple changes in-flight without any testing or scheduling
updates, getting administrative approval, or really doing any of that hard
stuff turns day-long tasks into a few minutes.

User: can I have...
Reply 2 minutes later: installed.

   It's not quite as simple as a single system image, but it's only about
twice as much work as doing one node and retains all the flexibility, and
doesn't require rebooting or re-imaging nodes and killing jobs.

> >deleted when no longer in use.  NFS (being stateless) doesn't have
> >this behavior, so after an update you may occaisionally have
> >jobs/daemons when they try to page in a file that has already been
               ^die  [oops]
> >replaced.

> 
> NFS isn't bad.  Nor does it necessarily doom a server to unbearable
> loads.  For some types of file access, especially read-only access to
> small (<8KB) configuration files such as ~/.foo.conf, it's pretty close
> to optimal.

   Oh, NFS is actually pretty good (most of gigE wire speed on large files),
and I really like being able to do maintenance on the fileservers without
killing jobs (shutdown and replace dead disks, switch kernels, etc).  It's
our fileservers that can't keep up: the cluster is able to pound them into
the ground over NFS.  If client-side NFS were worse, our fileservers would
remain responsive during major job launches.  :) Having users with on the
order of a million small files each, most of which they try to open during
the course of their jobs is pretty damaging.  All of this is research code,
so it tends to get written once (data not consolidated in a database), run
once on the cluster, and then published.  Localizing the damage helps,
RAID10 helps, and convincing people to stage off a sacrificial scratch
fileserver also helps.  There are some relatively new distributed
filesystems out there (Lustre, GFS, ...) that might survive this load
better, but we haven't tested them, some aren't really unix filesystems at
all, and we are a long way from ready to commit /home to one.