[Beowulf] What services do you run on your cluster nodes?

Wed Sep 24 11:59:23 PDT 2008

On Wed, 24 Sep 2008, Robert G. Brown wrote:

> On Tue, 23 Sep 2008, Donald Becker wrote:
> 
> >> XML is (IMO) good, not bad.
> >
> > I have so much to write on this topic, I'll take the first pot shot at RGB
> > ;-)
> >
> > XML is evil.  Well, evil for this.
> 
> Oh, it's fine.  I've gone the rounds on this one with Linus Himself (who
> agrees with you, BTW:-).

And he's smart and good looking... we have some things in common.  
Especially the good looking part.

> I'd argue -- persuasively I think -- that the evil was not in XML per se
> but in various other aspects of ganglia such as its flexibility and
> ability to absorb and report back "any" statistic.

That's part of the evil.  A cluster stat system shouldn't be completely 
flexible.  There are a core set of things that it must report, and it must 
report them regularly.  If it's not required, you can't depend upon them.  
That makes the end tools difficult or impossible to write.  They end up 
using their own reporting system.

> Now, Linus did argue equally (or even more:-) persuasively that the
> overhead associated with converting /proc itself to use XML per se was
> too great and sure, he's probably correct.

The way the Linux kernel reports values is pretty much a tangent.  The 
values we need to report are mostly independent of the kernel 
details.

Yes, it's been a pain to re-work the BeoStat system for the 2.2, 2.4 and 
2.6 kernels.  But XML wouldn't have changed that.  People would still have 
changed what the numbers mean, perhaps even more freely with the mistaken 
belief that XML would magically allow old tools to understand very 
different meanings for the numbers e.g. free memory.

> right data objects are and how they relate to other objects and are most
> naturally arranged.  The data view thus produced is self-documenting to
> the extent sensible tag names are chosen, manifestly hierarchical (it

Hmmm, this is a hot-button issue.  The value are not self-documenting.  
Ganglia is a good example.  There are values that it reports just because
the kernel reports them.  And other values that you have to read the code
to understand.  Values like "load_one" and "proc_total".  What is
"proc_total"?  What can I do with it?  (The answers: it's approximately
"ps x | wc", and it's mostly useless since it doesn't eliminate 
baseline system processes.)

> Usually, the best possible tradeoff is one that doesn't, one that yields
> the best of both worlds.  For example, XML would be just great if it
> were possible to construct an a priori tag dictionary, create a map from
> tags to information-theoretically minimal binary (where every tag in
> most problems could be reduced to one or at most two bytes in length),
> and install the DICTIONARY on both sender and receiver ends.
...
> This would
> effectively compress the actual XML to where the overhead associated
> with sending the message is down to perhaps 10-40% of the raw binary
> message, not an unreasonable price to pay for trivial library-based

Ahhh, you are describing a different class of system than XML.  Something 
closer to AML, although reading the documentation for AML will hurt your 
brain.

When I first heard about XML I assumed that it was this type of system --
one that described how to automatically decode tightly-packed objects.  
Something that you would have in a file header, before a million records.  
Or send once at the beginning of a six month connection that handles
millions of cluster stat packets.

That's would be exactly the right kind of system here.  Except there isn't 
one in common use.  So we have had to manually write one-off structure 
encoders and decoders.  It's not a big deal, but it could/should be easier 
and better.  The only down-side is that we can't fire up WireShark and 
understand the UDP packet contents.

> xmlsysd is -- I think -- very nicely hierarchically organized.  It
> achieves adequate efficiency for many uses a different way -- it is only
> called on (client side) demand, so the network isn't cluttered with
> unwanted or unneeded casts (uni, multi, broad).  It is "throttleable" --
> the client controls the daemon on each host, and can basically tell it

The very first implementation of what later became BeoStat did this.  
There was only a single client, a display GUI, and it told the nodes how
and how often to report.  That turned out to be a bad design.  No other
tools could rely upon the reporting stream contents, and it couldn't be
used for liveness indication.  Once we redesigned to a send-only 
system that reported once per second it became generally useful.

For a concrete example: the GUI might not care about CPU utilization
percentages and not gather the numbers.  But when we run a MPI process
mapper (a mapper is a one-off, at-this-instant layout scheduler) we don't
want to wait while those numbers are requested and reported.  We want to
support both continuously-running (GUI display tools) and start-exit
(extract data, analyze, report) usage.

> Well, both of them have to be sent by the network.  One can choose UDP
> or TCP for either one, and each has advantages and disadvantages (with
> -- gulp -- the same tradeoffs between reliability and robustness and
> speed).

There are trade-offs between UDP and TCP.  But UDP is the right model for
stats.  I pretty much don't care about old numbers.  I'll go further -- I
want to forget old numbers.  If there is a problem, communication or
processing, that blocks updates for a minute, I want the numbers to 
look stale.  And I would rather get the new stats right away than process 
and post a series of old messages.

> Actually, from here on down -- with the exceptions of choosing to
> use xml to encapsulate the return, TCP instead of UDP, and allowing the
> client side to control and throttle the daemon so that one can tune the
> impact of the monitoring to the demands of the cluster and task, the two
> things sound very similar -- as they should be, given that they're both
> N>3 generation tools.

The bottom line for all of this isn't "mine is better than yours".  I 
would like to see a common cluster state/status/statistics reporting 
system.  It doesn't have to look exactly like BeoStat, but I expect a good 
one wouldn't be too far from the current BeoStat design.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA