[Beowulf] How Can Microsoft's HPC Server Succeed?
becker at scyld.com
Sat Apr 19 12:26:28 PDT 2008
On Sat, 19 Apr 2008, Jan Heichler wrote:
> BC> But then I changed my mind when I started
> BC> to hear what a great feature it is to have several nodes booting and
> BC> installing the OS in the same 50 minutes (yes, minutes!) that a single
> BC> node takes, due to a wonderful feature called multicast.
> 50 minutes for a single node is of course unacceptable. 50 Minutes for
> 256 nodes is okay i think.
With Scyld we do dynamic provisioning and "diskless administration", so
all run-time elements come from a master with each boot. That means we
a great deal of effort into making "installation" fast.
We have had releases that take 0.750 seconds before a compute node is
ready to accept its first job. Yes, that's under a second to start a
kernel, activate the network, set up a connection to the cluster
master, and configure an application environment.
(Of course that's ignoring the time in the BIOS counting memory and PXE's
two second delay. And that node hardware had no disks and few devices,
the kernel was 2.4 which started faster, and everything software
setting was left at the default e.g. no mounted filesystems or extra
services. A more typical start-up time is 5-15 seconds.)
But in my experience, even 10 seconds is too long for a scalable cluster
Why? Because 100 or 1000 nodes is a really big number. Consider that a
untuned master/boot-server spends about 20-25% of its focus on that node
during the 10 seconds. Or, just for approximation, we can boot 4-5 nodes
at once. That means it takes 3-5 minutes to bring 100 nodes up, and it's
unbearably long to bring 1000 nodes up without auxiliary servers.
That's why we redesigned part of our boot system when we started to take
10 seconds. And it's why I consider full installation to be unworkable
for large clusters, especially when re-installation is considered to be
part of cluster administration.
> But i doubt that it scales that well. Even Multicast packages get lost -
> needs retransmission etc.
This pushed the hot button that really triggered my response. We have
multicast options in many parts of our system. But they are always turned
Multicast is a parlor trick, like balancing 10 plates on your nose while
riding a unicycle. It makes for a great show, but you aren't served that
way at your local diner.
We made the mistake once.. a single release.. never to be repeated.. of
turning on multicast by default. It worked for us, with small unmanaged
switches in the test lab. It broke when we sent the release to customers.
And it broke in a different way for each configuration.
The problem with multicast isn't how well it performs when it works, it's
what happens when things go wrong. Designing a multicast protocol
requires knowing the characteristics of the transmission media. What
happens when a packet is lost? Is it lost for everyone? Do you lose
packets one at a time or in bursts? Do you lose them in small bursts or
long bursts? Are losses equally spaced or randomly distributed?
Switches make different choices about discarding packets when overloaded.
(About the only common choice they make is that multicast packets get
tossed first.) They change loss behavior as the load changes. Cascading
switches, including multiple switch chips in a single box, multiplies the
number of ways they can behave.
The more you look at the problem, the more you understand that's designing
even simple multicast protocols is very similar to designing error
correcting codes (a very difficult problem), and then we toss complexity
on top of that. Out of order packets, feedback-generated losses (where
sending "please resend packet N" caused more loss), etc.
Hmmm, this probably more readable if I stick a list in:
A few problems with multicast bulk transfer are
It slows the protocol to TFTP speed
It discards the ability to use TCP transmit offload hardware and
software fast-path receive.
Many switches prefer to drop multicast packets, presuming that it is
low-priority traffic. This is especially true when multicast is
handled by software on the switch, with the embedded CPU sized
assuming a typical environment with very little multicast traffic.
Multicast packets drops may be pattern based, leaving some transfers
Multicast filtering is imperfect on most host NICs, resulting in a
hidden CPU cost and unpredictable performance on machines not
participating in the transfer.
Why do some people think that multicast is a good idea? It might be just
a good example of lessons learned not being unlearned. Using multicast was an
excellent solution in the days of Ethernet coax and repeaters. When using
a repeater all machines see all traffic anyway, and older NICs treated all
packets individually. Today clusters use switched Ethernet or specialized
interconnects (Infiniband, Myrinet, etc.), all of which must handle
multicast packets as a special case and emulate the behavior.
Donald Becker becker at scyld.com
Penguin Computing / Scyld Software
Annapolis MD and San Francisco CA
More information about the Beowulf