[Beowulf] What services do you run on your cluster nodes?
becker at scyld.com
Wed Sep 24 11:59:23 PDT 2008
On Wed, 24 Sep 2008, Robert G. Brown wrote:
> On Tue, 23 Sep 2008, Donald Becker wrote:
> >> XML is (IMO) good, not bad.
> > I have so much to write on this topic, I'll take the first pot shot at RGB
> > ;-)
> > XML is evil. Well, evil for this.
> Oh, it's fine. I've gone the rounds on this one with Linus Himself (who
> agrees with you, BTW:-).
And he's smart and good looking... we have some things in common.
Especially the good looking part.
> I'd argue -- persuasively I think -- that the evil was not in XML per se
> but in various other aspects of ganglia such as its flexibility and
> ability to absorb and report back "any" statistic.
That's part of the evil. A cluster stat system shouldn't be completely
flexible. There are a core set of things that it must report, and it must
report them regularly. If it's not required, you can't depend upon them.
That makes the end tools difficult or impossible to write. They end up
using their own reporting system.
> Now, Linus did argue equally (or even more:-) persuasively that the
> overhead associated with converting /proc itself to use XML per se was
> too great and sure, he's probably correct.
The way the Linux kernel reports values is pretty much a tangent. The
values we need to report are mostly independent of the kernel
Yes, it's been a pain to re-work the BeoStat system for the 2.2, 2.4 and
2.6 kernels. But XML wouldn't have changed that. People would still have
changed what the numbers mean, perhaps even more freely with the mistaken
belief that XML would magically allow old tools to understand very
different meanings for the numbers e.g. free memory.
> right data objects are and how they relate to other objects and are most
> naturally arranged. The data view thus produced is self-documenting to
> the extent sensible tag names are chosen, manifestly hierarchical (it
Hmmm, this is a hot-button issue. The value are not self-documenting.
Ganglia is a good example. There are values that it reports just because
the kernel reports them. And other values that you have to read the code
to understand. Values like "load_one" and "proc_total". What is
"proc_total"? What can I do with it? (The answers: it's approximately
"ps x | wc", and it's mostly useless since it doesn't eliminate
baseline system processes.)
> Usually, the best possible tradeoff is one that doesn't, one that yields
> the best of both worlds. For example, XML would be just great if it
> were possible to construct an a priori tag dictionary, create a map from
> tags to information-theoretically minimal binary (where every tag in
> most problems could be reduced to one or at most two bytes in length),
> and install the DICTIONARY on both sender and receiver ends.
> This would
> effectively compress the actual XML to where the overhead associated
> with sending the message is down to perhaps 10-40% of the raw binary
> message, not an unreasonable price to pay for trivial library-based
Ahhh, you are describing a different class of system than XML. Something
closer to AML, although reading the documentation for AML will hurt your
When I first heard about XML I assumed that it was this type of system --
one that described how to automatically decode tightly-packed objects.
Something that you would have in a file header, before a million records.
Or send once at the beginning of a six month connection that handles
millions of cluster stat packets.
That's would be exactly the right kind of system here. Except there isn't
one in common use. So we have had to manually write one-off structure
encoders and decoders. It's not a big deal, but it could/should be easier
and better. The only down-side is that we can't fire up WireShark and
understand the UDP packet contents.
> xmlsysd is -- I think -- very nicely hierarchically organized. It
> achieves adequate efficiency for many uses a different way -- it is only
> called on (client side) demand, so the network isn't cluttered with
> unwanted or unneeded casts (uni, multi, broad). It is "throttleable" --
> the client controls the daemon on each host, and can basically tell it
The very first implementation of what later became BeoStat did this.
There was only a single client, a display GUI, and it told the nodes how
and how often to report. That turned out to be a bad design. No other
tools could rely upon the reporting stream contents, and it couldn't be
used for liveness indication. Once we redesigned to a send-only
system that reported once per second it became generally useful.
For a concrete example: the GUI might not care about CPU utilization
percentages and not gather the numbers. But when we run a MPI process
mapper (a mapper is a one-off, at-this-instant layout scheduler) we don't
want to wait while those numbers are requested and reported. We want to
support both continuously-running (GUI display tools) and start-exit
(extract data, analyze, report) usage.
> Well, both of them have to be sent by the network. One can choose UDP
> or TCP for either one, and each has advantages and disadvantages (with
> -- gulp -- the same tradeoffs between reliability and robustness and
There are trade-offs between UDP and TCP. But UDP is the right model for
stats. I pretty much don't care about old numbers. I'll go further -- I
want to forget old numbers. If there is a problem, communication or
processing, that blocks updates for a minute, I want the numbers to
look stale. And I would rather get the new stats right away than process
and post a series of old messages.
> Actually, from here on down -- with the exceptions of choosing to
> use xml to encapsulate the return, TCP instead of UDP, and allowing the
> client side to control and throttle the daemon so that one can tune the
> impact of the monitoring to the demands of the cluster and task, the two
> things sound very similar -- as they should be, given that they're both
> N>3 generation tools.
The bottom line for all of this isn't "mine is better than yours". I
would like to see a common cluster state/status/statistics reporting
system. It doesn't have to look exactly like BeoStat, but I expect a good
one wouldn't be too far from the current BeoStat design.
Donald Becker becker at scyld.com
Penguin Computing / Scyld Software
Annapolis MD and San Francisco CA
More information about the Beowulf