[Beowulf] What services do you run on your cluster nodes?

Tue Sep 23 07:41:00 PDT 2008

On Tue, 23 Sep 2008, John Hearns wrote:

> 2008/9/23 Robert G. Brown <rgb at phy.duke.edu>
>
>>
>> This meant that there could be hundreds or even thousands of machines
>> that saw every packet produced by every other machine on the LAN,
>> possibly after a few ethernet bridge hops.  This made conditions ripe
>> for what used to be called a "packet storm" (a term that has been
>> subverted as the name and trademark of a company, I see, but alas there
>> is no wikipedia article on same and even googled definitions seem
>> scarce, so it is apparently on its way to being a forgotten concept).
>>
>>
> Bob, the packet storm is not a forgotten concept. I've seen many a packet
> storm, and not that long ago.
> On Beowulf clusters. Just think what happens if your Spanning Tree protocol
> goes wonky.
> That's a reason why I'm no great lover of Ganglia too - it just sprays
> multicast packets all over your network.
> Which really should be OK - but if you have switches which don't perform
> well with multicast you get problems.

It isn't a forgotten concept as long as there are Old Guys still around,
for sure, but the switch DID all but elminate "collisions" and the
timing problems that led to the worst pathology, and it also dropped the
CPU load associated with global network traffic from "tightly coupled"
to "minimally coupled".  Multicast/broadcast traffic has always been a
problem (remember DECNET and Appletalk?  Couldn't tie its own shoes
without broadcasting to the entire network, so it didn't NEED external
nucleation, being on one of the networks beyond a certain size was sort
of like living in a perpetual storm:-) but it is much BETTER than it
used to be, at least.

I don't know why ganglia uses multcasts.  Old Guys also remember rwhod,
a very early Sun daemon that basically was part of the "network is the
computer" thing they had going.  Every few seconds, it would wake up and
broadcast more or less what one got out of the "uptime" command, and on
clients anyone could enter the "ruptime" command and basically get
uptime from the entire LAN (inside the broadcast radius of the nearest
broadcast-blocking device).

This as fine for tiny LANs with network isolation.  For big flat LANs
with few bridges, lots of hosts, forwarding of broadcasts (necessary if
servers spanned the LAN), well, let's just say it didn't scale and leave
it at that.  So it isn't like people haven't KNOWN better than to use a
broadcast simply forever.

Besides, the nodes in most clusters -- or clients in most LANS -- don't
all need to know what the other nodes/clients/servers are doing.
Management and monitoring is intrinsically master/slavelike -- one host
(the one I'm sitting at) wants to access all the information from the
node/client/servers.  It is intrinsically a serial bottleneck.
Persistent network connections and round-robin will intrinsically
optimize at least PART of the bottleneck associated with gathering the
information.  Dumping out multicasts doesn't mean that the toplevel
monitoring host isn't serially bottlenecked, it only means it doesn't
have any control over what gets sent, collisions, when it handles
incoming information.  If there were FIFTY hosts each needing
information about all the others, multicasts would be good, but when
there is just one it seems like they would be bad.

Truthfully, one thing I learned writing xmlsysd is that a monitoring
system for a cluster IS a parallel application.  Different IPC designs
have very different scaling.  Master-slave is ideal for certain regimes.
Multicast or tree structures might well scale better for other regimes.
I wrote xmlsysd to work well for relatively small clusters -- out to
somewhere over 100 nodes -- or most LANs.  I have no idea how well it
would work at 1024 nodes -- maybe terribly.  And there are still a few
small flaws in it that I'd like to work on one day, and probably will if
anybody other than myself and the four or five other people that I know
of start using it...;-) But nothing show stopping that I know of -- I
use it routinely over days of continuous monitoring and it seems to work
just fine.

   rgb

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977