[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Wed Dec 13 15:00:53 PST 2006

On Wed, 13 Dec 2006, Simon Kelley wrote:
> Donald Becker wrote:
> > On Tue, 12 Dec 2006, Simon Kelley wrote:
> >> Joe Landman wrote:
> >>>>> I would hazard that any DHCP/PXE type install server would struggle
> >>>>> with 2000 requests (yes- you arrange the power switching and/or
> >>>>> reboots to stagger at N second intervals).
> >
> > The limit with the "traditional" approach, the ISC DHCP server with one of 
> > the three common TFTP servers, is about 40 machines before you risk losing 
> > machines during a boot.  With 100 machines you are likely to lose 2-5 
> > during a typical power-restore cycle when all machines boot 
> > simultaneously.
...
> > The right solution is to build a smart, integrated PXE server that 
> > understands the bugs and characteristics of PXE.  I wrote one a few years 
> 
> Is that server open-source/free software, or part of Sycld's product? No
> judgement implied, I'm just interested to know if I can download and
> learn from it.

When I wrote the first implementation I expected that we would be 
publishing it under the GPL or a similar open source license, as we had 
with most of our previous software.
But the problems we had with Los Alamos removing the Scyld name and
copyright from our code (the Scyld PXE server uses our "beoconfig" 
config file interface, which is common to both BProc and BeoBoot) caused 
us to not publish the code initially.  And as often happens, early 
decisions stick around far longer than you expect.

At some point we may revisit that decision, but it's not currently
a priority.  I have been very willing to talk with people about the 
implementation, although only people such as Peter Anvin (pxelinux) and 
Marty Conner (Etherboot) don't quickly find a reason to "freshen their 
drink" when I start ;->.

> >>> fwiw:  we use dnsmasq to serve dhcp and handle pxe booting.  It does a
> >>> marvelous job of both, and is far easier to configure (e.g. it is less
> >>> fussy) than dhcpd.

The configuration files issue was one of the triggering reasons for 
investigating writing our own server.

Until 2002 we were focused on BeoBoot as the solution for booting nodes, 
and PXE was a side thought to support a handful of special machines, such 
as the RLX blades.

As PXE became common we went down the path of using our config file 
to generate ISC DHCP config files.  This broke one of my rules: 
avoid using config files to write other config files.  You can't trace 
updates to their effects, and can't trace problems to their source.  This 
was a test that proved the rule: we had three independent 
ways to write the config files to have backups if/when we encountered a 
bug.  But that meant three programs were broken each time the ISC DHCP 
config file changed incompatibly.

> >> Joe, you might like to know that the next release of dnsmasq includes a
> >> TFTP server so that it can do the whole job. The process model for the
> >> TFTP implementation should be well suited to booting many nodes at once
> >> because it multiplexes all the connections on the same process. My guess
> >>  is that will work better then having inetd fork 2000 copies of tftpd,
> >> which is what would happen with traditional TFTP servers.
> > 
> > Yup, that's a good start.  It's one of the many things you have to do.  

It should repeat this: forking a dozen processes sounds like a good idea.
Thinking about forking a thousand (we plan every element to scale to "at 
least 1000") makes "1" seem like a much better idea.

With one continuously running server, the coding task is harder.  You 
can't leak memory.  You can't leak file descriptors.  You have to check for 
updated/modified files.  You can't block on anything.  You have to re-read
your config file and re-open your control sockets on SIGHUP rather than 
just exiting.  You should show/checkpoint the current state on SIGUSR1.

Once you do have all of that written, it's now possible, even easy, to 
count have many bytes and packets were sent in the last timer tick and to 
check that every client asked for and received packet in the last half
second.  Combine the two and you can smoothly switch from bandwidth 
control to round-robin responses, then to slightly deferring DHCP 
responses.

> It's maybe worth giving a bit of background here: dnsmasq is a
> lightweight DNS forwarder and DHCP server. Think of it as being
> equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode
> but doing dynamic DNS and a bit of authoritative DNS too.

One of the things we have been lacking in Scyld has been an external DNS 
service for compute nodes.  For cluster-internal name lookups we 
developed BeoNSS.

BeoNSS uses the linear address assignment of compute nodes to
calculate the name or IP address e.g. "Node23" is the IP address 
of Node0 + 23.  So BeoNSS depends on the assignment policy of 
the PXE server (1).

BeoNSS works great, especially when establishing all-to-all communication.  
But we failed to consider that external file and license servers might not 
be running Linux, and therefore couldn't use BeoNSS.  We now see that 
we need DNS and NIS (2) gateways for BeoNSS names.

(1) This leads to one of the many details that you have to get right.
The PXE server always assigns a temporary IP address to new nodes.  Once
a node has booted and passed tests, we then assign it a permanent node 
number and IP address.  Assigning short-lease IP addresses then changing a 
few seconds later requires tight, race-free integration with the DHCP 
server and ARP tables.  That's easy with a unified server, difficult with 
a script around ISC DHCP.

(2) We need NIS or NIS+ for netgroups.  Netgroups are use to export file 
systems to the cluster, independent of base IP address and size changes.

> Almost coincidentally, it's turned out to be useful for clusters too. I
> known from the dnsmasq mailing list that Joe Landman has used it in that
> way for a long time, and RLX used it in their control-tower product
> which has now been re-incarnated in HP's blade-management system.

I didn't know where it was used.  It does explain some of the Control 
Tower functionality.

> receives a UDP packet, computes a reply as a function of the input
> packet, the in-memory lease database and the current configuration, and
> synchronously sends the reply. The only time it even needs to allocate
> memory is when a new lease is created: everything else manages which a
> single packet buffer and a few statically-allocated data structures.
> This makes for great scalability.

You might consider breaking the synchronous reply aspect.  It's convenient 
because you can build the reply into the same packet buffer as the inbound 
request.  But it makes it difficult to defer responses.  (With DHCP you
can take the sleazy approach of "only respond when the elapsed-time is 
greater than X", at the risk of encountering PXE clients with short 
timeouts.)

> style, so I hope it will scale well too. I've already covered some of
> Don's checklist, and I'll pay attention to the rest of it, within the
> contraint that this has to be small and simple, to fit the primary, SOHO
> router, niche.

You probably won't want to go the whole way with the implementation, but 
hopefully I've given some useful suggestions.

> > As part of writing the server I wrote a DHCP and TFTP clients to simulate 
> > high node count boots.  But the harshest test was old RLX systems: each of 
> > the 24 blades had three NICs, but could only boot off of the NIC 
> > connected to the internal 100base repeater/hub.  Plus the blade BIOS had a 
> > good selection of PXE bugs.
> 
> By chance, I have a couple a shelves of those available for testing.
> Would that be enough (48 blades, I guess) to get meaningful results?

Yes.  Better, try running the server on one of the blades, serving the 
other 47.  Have the blade do some disk I/O at the same time.  Transmeta 
CPUs were not the fastest chips around, even in their prime.

[[ Hmmm, did this posting come up to RGB standards of length+detail? ]]

-- 
Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993