[Beowulf] What services do you run on your cluster nodes?

Thu Sep 25 11:53:14 PDT 2008

On Wed, 24 Sep 2008, Donald Becker wrote:

>> xmlsysd is -- I think -- very nicely hierarchically organized.  It
>> achieves adequate efficiency for many uses a different way -- it is only
>> called on (client side) demand, so the network isn't cluttered with
>> unwanted or unneeded casts (uni, multi, broad).  It is "throttleable" --
>> the client controls the daemon on each host, and can basically tell it
>
> The very first implementation of what later became BeoStat did this.
> There was only a single client, a display GUI, and it told the nodes how
> and how often to report.  That turned out to be a bad design.  No other
> tools could rely upon the reporting stream contents, and it couldn't be
> used for liveness indication.  Once we redesigned to a send-only
> system that reported once per second it became generally useful.

I opted to split off a library that facilitates the building of UIs, as
well as a tool that basically unpacks certain displays (predefined
clusters of reported stats) and dumps them tablewise to stdout.  So
anybody can get to the stream, parse it, and use it, even if they only
know how to use split in perl and are clueless about the XML tools that
exist for pretty much any possible programming environment.  In C you
can just clone the non-curses part of wulfstat hack it to fit, and it is
mostly just a set of library calls.  I wouldn't say my split is very
good yet -- I'm learned the hard way with dieharder (and libdieharder
that does all the work) that getting the library "just right" to support
multiple UIs is not easy and not likely to be ideal the first couple of
tries.  But on the next rewrite it should be PRETTY good.

The liveness issue is most definitely a problem with xmlsysd/wulfstat,
because frankly TCP sucks for this specific purpose.  I'd love to have
what amounts to ping built into the UI, but it is a restricted socket
command and I don't want to make the UI suid root.

It would be trivial to recode xmlsysd to work the other way -- in fact,
not too difficult to make it work BOTH ways, allow an initial
straight-up TCP connection, configure the daemon to your desired set of
statistics (all existing features) and then add two new commands to tell
it to close the connection and begin to unicast back to host X, with
frequency Y.  Or better yet, leave the TCP connection open as a control
interface WHILE it casts back at Y, permitting the controlling host to
send synchronization feedback and drive its gathering of messages
towards "simultaneity" in some narrowly confined sub-interval of time
period Y, or to reconfigure the message on the fly, or to request a
"snapshot" of information accessible to the daemon but not in the
regular message out-of-band.  I actually really like this idea -- the
live TCP connection actually enables lots of things, all controlled from
the client/master node side, while STILL obtaining the benefit of
unicast and UDP.

That would partly resolve the ping issue and let me maybe make downed
host identification and reconnection as it comes back up a bit more
robust -- not issues in your design but issues in mine where I have to
cope with TCP timeouts and so on to decide when something is down, which
can lead to poor performance on the client UI side (the nodes don't
care).

> For a concrete example: the GUI might not care about CPU utilization
> percentages and not gather the numbers.  But when we run a MPI process
> mapper (a mapper is a one-off, at-this-instant layout scheduler) we don't
> want to wait while those numbers are requested and reported.  We want to
> support both continuously-running (GUI display tools) and start-exit
> (extract data, analyze, report) usage.

Sure.  But with a "permanent" direct connection, you just tell the
daemon when you want only one small (but predefined -- this isn't about
infinite user choice or a lack of design discipline) constellation of
outputs, and it stops polling the parts of /proc you don't care about.
If you suddenly need something -- a snapshot of running non-root
processes, a complete picture of meminfo, the clock and cache size and
architecture of the CPU (from different parts of /proc, in the latter
case something you will need only VERY infrequently and on user demand)
you just say e.g.

  on cpuinfo

and either

  send

to get an immediate reply via TCP or wait until the next unicast to get
memory stuff added to the regular stream.

Yes, talking to the socket is "bad" as it interferes with synchronicity,
but then, you don't do it all the time and it is much cheaper than
having to actively (re)connect to the node to make a base configuration
change and restart everything. In the meantime you quietly accumulate
cycle savings by NOT parsing all the process IDs unless you really need
to, by NOT parsing /proc/cpuinfo or even /proc/stat unless the user
wants to look at it (well, truthfully xmlsysd reads cpuinfo just once at
the beginning and then just RETURNS it if requested anyway, so mostly
you save a bit of bw and packet size but people only look at this sort
of information for a few seconds anyway, no need to really poll it).

>> Well, both of them have to be sent by the network.  One can choose UDP
>> or TCP for either one, and each has advantages and disadvantages (with
>> -- gulp -- the same tradeoffs between reliability and robustness and
>> speed).
>
> There are trade-offs between UDP and TCP.  But UDP is the right model for
> stats.  I pretty much don't care about old numbers.  I'll go further -- I
> want to forget old numbers.  If there is a problem, communication or
> processing, that blocks updates for a minute, I want the numbers to
> look stale.  And I would rather get the new stats right away than process
> and post a series of old messages.

I'm not sure that this latter is an intrinsic difference/feature between
UDP and TCP; wulfstat doesn't display stale stats either.  That's really
a UI choice in what it does when EITHER message fails to get through.
UDP you either catch the message or you don't, and with large numbers of
hosts replying in a DELIBERATELY small window, I'm guessing you drop a
lot of the messages and hosts blink in and out (unless you cache them
long enough to mask at least a round or two of missing info).  With TCP
dealing with per-host random delays without blocking and detecting host
crashes is most definitely a pain, but not impossible.  I keep telling
myself, anyway...:-)

And as I said above, it seems as though one could have the best of both
-- it isn't really necessary to choose "only" one; both could even be
accessible simultaneously within the same running daemon.  You've
inspired me, in the best of open source traditions, and soon I will have
to Write More Code.  This will let me "fix" a number of things that have
annoyed me about xmlsysd (generally functional as it is).

Just as soon as I have time, since I have only six ongoing projects
plus two classes and two more independent study students, and dieharder
is taking most of my elective time.  Humans have their own scheduling
woes and I've been thrashing for a decade in spite of modest upgrades in
capacity...:-)

There you've got an advantage in addition to your natural good looks, I
guess, with people who will actually pay you to make changes and
improvements to your product.  I just do it out of a mix of love and for
my own use.

I'd rather have the money -- or perhaps would rather ALSO have the
money...;-)

>> Actually, from here on down -- with the exceptions of choosing to
>> use xml to encapsulate the return, TCP instead of UDP, and allowing the
>> client side to control and throttle the daemon so that one can tune the
>> impact of the monitoring to the demands of the cluster and task, the two
>> things sound very similar -- as they should be, given that they're both
>> N>3 generation tools.
>
> The bottom line for all of this isn't "mine is better than yours".  I
> would like to see a common cluster state/status/statistics reporting
> system.  It doesn't have to look exactly like BeoStat, but I expect a good
> one wouldn't be too far from the current BeoStat design.

The interesting thing is that our independently arrived at designs are
remarkably SIMILAR -- much more like one another than either one is like
ganglia, for example.  I'm guessing that our proc parsing code is quite
similar on the back end, we both seem to report similar constellations
from proc without reporting EVERYTHING from proc or necessarily letting
a user muck around with what the tool can deliver.

Outside of the functional core, you chose one way to deliver messages
and configure (or not) the tool -- efficient but hard to change or debug
or human read -- where I chose the other, relatively inefficient but
much easier to debug or human read and controllable from a small palette
of choices.  Yours is tightly integrated, mine isn't really "integrated"
at all.

They are also "intended" to be used in different kinds of environments.
xmlsysd is a standalone object -- drop it onto any linux system and it
should just work, providing a connection-oriented relatively lightweight
remote client controllable window into the local /proc and systems
information space.  beostat sounds (correct me if I'm wrong) much more
like a fully integrated component of an all-or-nothing package.  You
wouldn't, maybe couldn't, install it on an plain old workstation and use
it as part of a straight sysadmin package to keep an eye on a LAN as
easily as a cluster, where sometimes I think wulfstat is MORE useful to
LAN admins than it is to a "real cluster" administrator, with their
stringent scaling requirements -- it certainly is designed for use on
small to midsize LAN-ish clusters with stock kernels more than for 2048
node superclusters.

What we really ought to do is exchange our data views (dictionary and
encapsulation), kick them around, and arrive at a non-too-horrible
consensus where at least one or our data views is an actual subset of
the other.  I "think" it would be pretty easy to add a command to
xmlsysd such as "on beostat" that caused it simply do what it does now
but pack the result into beostat-compatible UDP packets.  If I DID leave
in the "out of band" TCP control channel -- something that the overall
scyld package probably accomplishes an entirely different way -- one
could perhaps get the best of both worlds -- something with the
operational leanness and scalability advantages of beostat but ALSO with
the ease of use and debuggability of xmlsysd.  EVEN on a Scyld type
cluster, there might be times when it is useful to be able to just
telnet into a node's xmlsysd port and tell one in human readable form
"just what do you think you are doing?", and on a LAN client that might
well be the dominant mode until 11 pm when you reboot (or not reboot,
merely "repurpose") the LAN client into being part of a
beostat-monitored-and-MPI-fronted cluster overnight.

It would probably be simpler -- and more philosophically acceptable,
since xmlsysd is already a "gorpier" and more general purpose tool -- to
teach xmlsysd beostatish than to teach beostat to speak xmlish, but of
course you are welcome to copy, grab, etc xmlsysd's GPL code and make it
your own, or to otherwise steal the idea of a throttleable/remote
controllable command interface if you don't already have one.

The point is that if we COULD agree on data content and encapsulation --
or even offer a limited menu of choices of same, as xmlsysd already
endeavors to do -- then it would be very simple to make UI and
application tools that were interoperable and portable from LAN to
cluster, supported by a co-provided library and API.  Maybe even make it
easy to build a semi-portable load balancer, scheduler, job distribution
system etc, or to just build access to this block of information right
into applications.

Just a thought.

    rgb

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977