[Beowulf] Re: motherboards for diskless nodesy

Thu Feb 24 19:03:59 PST 2005

On Thu, 24 Feb 2005, Drake Diedrich wrote:
> On Thu, Feb 24, 2005 at 06:20:21PM -0500, Jamie Rollins wrote:
>>
>> How netboot-capable are modern motherboards with on-board nics?  I
>> have experience with a couple that support PXE.

PXE is the only standard way to netboot a PC.  It has swept away the
few other proprietary approaches (e.g. RPL).

I'm not a fan of the PXE protocol and specification details.  It's ugly.
They picked bad semantics.  They picked bad protocols.  They picked
exceptionally bad parameters.

I'm a huge fan of PXE as a standard.
It didn't need to be right.  It just needed to be good enough that we can
make it work reliably.  PXE, and all of its ugliness, is gone in two
seconds.

And with 50+ million installations, it's everywhere.
That Is Good.

>> However, I have
>> been having a hard time finding information on-line stating expicitly that
>> a given motherboard and/or bios supports netbooting.

Virtually every current motherboard with Ethernet supports PXE booting.

A few years ago, when Gigabit Ethernet was new, some motherboards had
both Fast and Gb Ethernet just because there was no PXE and WOL support
for the GbE chips.

>   You probably want to buy one in advance to test how reliable it is when
> PXE booting.  We have a 64-node cluster with local disks that have no CDROMs
> or floppies, and we do maintenance and installs by net booting.  It isn't
> reliable.  We have to reboot several times to get the things to hear the
> PXE/DHCP replies and boot the pxelinux.0 image when attempting to reinstall
> a node.

The specific problem here is very likely the PXE server implementation,
not the client side.

I'm guessing that you are using the ISC DHCP server combined with one of
stand-alone TFTP server.  This can't provide true PXE service, it cannot
work around more than a single version of PXE bugs, and it has
significant "scalability challenges" when many machines are
simultaneously booting.  Almost every BIOS uses the Intel PXE client
code unchanged, and it accepts DHCP responses with static PXE
information.

We ended up writing our own integrated PXE server to reliably boot
compute nodes.  A purpose-built PXE server can
   - interpret the initial request to work around different generation of
     PXE client bugs.  The BIOS code is unlikely to be fixed, and there
     are some pretty ugly bugs.  (What does a file name of '' mean?  Use
     the last file requested...)
   - work around the TFTP capture effect, where clients that drop a
     packet are squeezed out and quickly give up, leaving the machine
     powered on but useless.
   - defer answering new requests when especially busy, but always
     respond before the client times out.

Just as importantly, our PXE server uses and updates the single cluster
configuration file.  Before writing the server we went through several
rounds of writing configuration files from other configuration files,
and each time we ended up with a fragile implementation that was
difficult to debug.

> older P-III Tyan boards, new Xeon Supermicro/e1000 boards, etc).  Not all
> motherboard PXE/DHCP boot implementations are equal and up to the task for
> completely diskless use.  If you switch to a slightly newer motherboard on
> deployment, all bets are off again (yes, made that mistake once, but had a
> friendly supplier who let us exchange parts until it all worked).

Yup, they are using the Intel code, but with different bugs.  Read notes
on the web about using the ISC DHCP server: "You can work around 
bug #1 with canned response #1, which is incompatible with the response
to fix bug #2."

>   If you deal with temporary files, want to suspend and swap out large
> low-priority jobs, etc, you probably want a local disk on each node anyway.

I completely agree.  Local disk is the best I/O bandwidth for the buck.

> Spending a couple gigs of that for a locally installed O/S isn't much of a
> drama, especially on ~16 nodes.

But it's the long-term administrative effort that costs, not the disk
hardware.  The need to maintain and update a persistent local O/S is the
root of most of that cost.

> It makes updates more reliable, as in-use libraries/binaries that are
> in use remain on local disk even when dpkg replaces them, and only get
> deleted when no longer in use.  NFS (being stateless) doesn't have
> this behavior, so after an update you may occaisionally have
> jobs/daemons when they try to page in a file that has already been
> replaced.

A cluster has a richer versioning environment than a local machine.
Simultaneously using different versions of a long-running applications
is something you have to consider when running a cluster system.  And as
you point out, NFS sometimes does do the Right Thing.

But a persistent local install isn't the only way to accomplish this.
We put a specialized whole-file-caching filesystem underneath our
system.  It's only used for libraries and executable, and it tracks them
by version not just path name.  Since it retains the whole file, we
don't encounter the unrecoverable problem of a page-in failure.  And
since the node only fully accepts a process after caching the required
files we avoid other failure points.

> If you don't have a central fileserver yet, you can also spread your
> users' home directories among the disks on the nodes to avoid NFS contention
> (though this means no RAID unless you buy two disks per node).

NFS isn't bad.  Nor does it necessarily doom a server to unbearable
loads.  For some types of file access, especially read-only access to
small (<8KB) configuration files such as ~/.foo.conf, it's pretty close
to optimal.

What you don't want to use it for is
   - paging in programs and libraries.
       Especially not in a big cluster with big applications.
   - writing files that are used for synchronization
       NFS uses semi-synchronous writes, which kills performance, but
       unpredictable time-based cache flushing, which kills consistency

>> Something else that we're looking for that I believe is far more esoteric
>> and has been equally hard to find information about is BIOS serial console
>> redirect, ie. being able to control the bios from the serial port.

Serial consoles are pretty common today, albeit not well documented.
But using them is a hardware problem.  You double your connection count,
with non-standard cables, large connectors and expensive serial port
concentrators.  Compare that to Ethernet, with standard cables, tiny but
robust connectors, link beat lights, cheap switches and simple
configuration rules.

A much better solution is using a software system that has reliable
booting.  That means
   - Unchanging boot firmware on the node
   - Minimal hardware configuration before contacting boot server
   - Configuration reporting as part of the boot negotiation
   - No boot dependence on node configuration e.g. file system contents
   - Complete replacement of the boot software by the boot server
   - Immediate status logging over the network
   - Non-boot drivers and configuration controlled by the boot server

These are simple principles, but almost every system out there misses at
least one.

> down.  For console/BIOS, I prefer to just use a long monitor/keyboard cable
> and plug it directly into the node with the problems.

The "crash cart" approach.  I believe in it.  You should only need to
use the console when you have a hardware problem, and you'll need to be 
right by the machine anyway.

Footnote b1tch: Yup, PXE is ugly and stupid.  Most are hidden, but one
observable stupidity is how it tries to locate a server.  The client
tries for 1+2+4+16+32 seconds to locate a PXE server.  A switch with
spanning tree protocol enabled doesn't pass traffic for 60 seconds to
avoid network loops.  It should try for a slightly longer or much
shorter period.  And the exponential fallback is pointless.  Apparently
someone thought that it would be clever to use an Ethernet-like
fallback, imagining it would avoiding congestion.  But it would take
thousands of machines to saturate even 10Mbps Ethernet.  It's common for
the first packet or two to be dropped as the network link stabilizes,
leading to a two or four second delay.

Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993