[Beowulf] How Can Microsoft's HPC Server Succeed?

Sat Apr 19 12:26:28 PDT 2008

On Sat, 19 Apr 2008, Jan Heichler wrote:

> BC> But then I changed my mind when I started 
> BC> to hear what a great feature it is to have several nodes booting and 
> BC> installing the OS in the same 50 minutes (yes, minutes!) that a single
> BC> node takes, due to a wonderful feature called multicast. 
> 
> 50 minutes for a single node is of course unacceptable. 50 Minutes for 
> 256 nodes is okay i think.

With Scyld we do dynamic provisioning and "diskless administration", so 
all run-time elements come from a master with each boot.  That means we 
a great deal of effort into making "installation" fast.

We have had releases that take 0.750 seconds before a compute node is 
ready to accept its first job.   Yes, that's under a second to start a 
kernel, activate the network, set up a connection to the cluster 
master, and configure an application environment.

(Of course that's ignoring the time in the BIOS counting memory and PXE's 
two second delay.  And that node hardware had no disks and few devices, 
the kernel was 2.4 which started faster, and everything software 
setting was left at the default e.g. no mounted filesystems or extra 
services.  A more typical start-up time is 5-15 seconds.)

But in my experience, even 10 seconds is too long for a scalable cluster 
architecture.

Why?  Because 100 or 1000 nodes is a really big number. Consider that a 
untuned master/boot-server spends about 20-25% of its focus on that node 
during the 10 seconds.  Or, just for approximation, we can boot 4-5 nodes 
at once.  That means it takes 3-5 minutes to bring 100 nodes up, and it's 
unbearably long to bring 1000 nodes up without auxiliary servers.

That's why we redesigned part of our boot system when we started to take 
10 seconds.  And it's why I consider full installation to be unworkable 
for large clusters, especially when re-installation is considered to be 
part of cluster administration.

> But i doubt that it scales that well. Even Multicast packages get lost - 
> needs retransmission etc.

This pushed the hot button that really triggered my response.  We have 
multicast options in many parts of our system.  But they are always turned 
off.

Multicast is a parlor trick, like balancing 10 plates on your nose while 
riding a unicycle.  It makes for a great show, but you aren't served that 
way at your local diner.

We made the mistake once.. a single release.. never to be repeated.. of 
turning on multicast by default.  It worked for us, with small unmanaged 
switches in the test lab.  It broke when we sent the release to customers.
And it broke in a different way for each configuration.

The problem with multicast isn't how well it performs when it works, it's 
what happens when things go wrong.   Designing a multicast protocol 
requires knowing the characteristics of the transmission media.  What 
happens when a packet is lost?  Is it lost for everyone?  Do you lose 
packets one at a time or in bursts?  Do you lose them in small bursts or 
long bursts?  Are losses equally spaced or randomly distributed?

Switches make different choices about discarding packets when overloaded.  
(About the only common choice they make is that multicast packets get 
tossed first.)  They change loss behavior as the load changes.  Cascading 
switches, including multiple switch chips in a single box, multiplies the 
number of ways they can behave.

The more you look at the problem, the more you understand that's designing 
even simple multicast protocols is very similar to designing error 
correcting codes (a very difficult problem), and then we toss complexity 
on top of that.  Out of order packets, feedback-generated losses (where 
sending "please resend packet N" caused more loss), etc.

Hmmm, this probably more readable if I stick a list in:

A few problems with multicast bulk transfer are
   It slows the protocol to TFTP speed
   It discards the ability to use TCP transmit offload hardware and
      software fast-path receive.
   Many switches prefer to drop multicast packets, presuming that it is
     low-priority traffic.  This is especially true when multicast is
     handled by software on the switch, with the embedded CPU sized
     assuming a typical environment with very little multicast traffic.
   Multicast packets drops may be pattern based, leaving some transfers
     persistently incomplete
   Multicast filtering is imperfect on most host NICs, resulting in a
     hidden CPU cost and unpredictable performance on machines not
     participating in the transfer. 

Why do some people think that multicast is a good idea?  It might be just 
a good example of lessons learned not being unlearned.  Using multicast was an 
excellent solution in the days of Ethernet coax and repeaters.  When using 
a repeater all machines see all traffic anyway, and older NICs treated all 
packets individually.  Today clusters use switched Ethernet or specialized 
interconnects (Infiniband, Myrinet, etc.), all of which must handle 
multicast packets as a special case and emulate the behavior.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA