[Beowulf] What services do you run on your cluster nodes?

Fri Sep 26 17:19:49 PDT 2008

On Fri, 26 Sep 2008, Donald Becker wrote:

> But that rule doesn't continue when we move to higher core counts.  We
> still want a little observability, but a number for each of a zillion
> cores is useless.  Perhaps worse than useless, because each tool has to
> make its own decision about how to summarize the values before using them.
> A better solution is to have the reporting side summarize the values.

Why is this a better solution?  Might not applications NOT wish to
summarize or aggregate?  And why does the cutoff occur at 2 cpus (and
not 1).  And what do you choose to compute and return?  Aggregated
activity (not showing how it is distributed), or average activity (even
worse, just showing a nominal percentage of total aggregate activity?
And how do you differentiate (or do you) between a single processor dual
core and a dual processor single core and a single processor quad core
and a dual processor dual core, etc?

I'd say that ONE solution is to provide a tool that does client side
averages and aggregation, and I initially did it that way, in part to
minimize the size of the return and make life "easy", or so I thought.
I changed my mind, because I discovered that I DID want a lot of the
information detail that was being aggregated and hidden.  It actually
was very useful.  IMO at this point, it is "better" to provide the raw
numbers per core (and network device) to the client (monitor program,
head node, whatever) side and let IT decide what it wants to display or
how it wants to use the numbers.  Here is my reasoning.

A network bottleneck on a system with multiple network interfaces shows
up not necessarily as the aggregate being saturated, but as a particular
interface being saturated.  There may be multiple interfaces, and they
may not even have the same speed characteristics -- "saturation" on one
may be a small fraction of the capacity of another.  A CPU bottleneck --
especially in a cluster that isn't doing "the same thing" synchronously
on all the cores by hypothesis -- can be one processor core pegged at
100% but all the rest near idle (possibly waiting on the one that is
pegged!).  Even on an 8 core system -- again where it might be an 8 core
server in a LAN with cores allocated among several completely distinct
VMs (some of them running Windows and invisible to "direct" monitoring)
or an 8 core node in a cluster -- if the cores aren't running a set of
homogeneous tasks aggregates will not reveal a bottleneck or the
reaching of a resource limit at a glance.  mysql might be max'd out on a
single core and several mysql clients might be tamely waiting on it,
even though the AGGREGATE CPU is down there at < 20%, and with people
wondering why everything is slow.

When you make a toplevel design decision like this -- to aggregate and
average, or not -- you are basically picking a set of assumptions
concerning what people will need.  However, people's needs vary and
almost any assumption that you make will NOT fit the needs of some
subset.  That may be OK -- you make a tool that is right for a
restricted subspace of things that DO match your assumptions, and that's
fine.  In the case of scyld, you tightly define the cluster
architecture, so you MAKE it so the assumptions work on both ends.  But
the price is that it makes your tool all but useless to people with
problems your assumptions hide but that would be revealed and solvable
if you didn't hide them, or environments that don't match your
assumptions ditto.

I ran through similar considerations on the network side.  One
interface?  All "ethernet" interfaces?  Or just everything the kernel
tags as "an interface"?  Per interface, do I send back the raw packet
counts, or just the rates?  In other words, who does the dividing to
turn packet count deltas into a rate?

The interface decision was easy -- I started with just one, had a few
systems with two where I needed numbers on both, and finally just broke
down and send the information on all interfaces, and let the client
decide which ones are "important".  The rate issue was a tougher
decision, because xmlsysd doesn't operate on a predetermined schedule.
To get an accurate differential rate, one really should sample on a
fine-grained time granularity -- average over a time order 1 sec, say.
Again, one has the possibility of totally saturating the network for
short bursts but having a relatively low AVERAGE rate.  I tried various
schema for doing node-side averaging and none of them were very
satisfactory.

Ultimately -- and this is a decision I could easily reconsider or make a
client-controlled option, given a view of the TCP interface as being
there primarily as a control interface and using UDP to actually
back-transport the information -- I opted to just send the raw numbers
again -- it is consistent and the client side can decide if 1, 5, 500
second averages are OK (and indeed can decide to sample twice 0.1
seconds apart to generate local deltas and then NOT sample for 10
seconds, so the numbers given are an instant snapshot of the rate, but
that rate is only updated occasionally). Not necessarily the "ultimate"
solution -- I still think of implementing a fixed-width window and doing
a timed delta of perhaps 0.1 or 0.01 sec across it to make various rates
-- but that requires two interruptions of the background work in place
of one, and introduces a longish delay between polling the node and
getting the answer, neither desireable on a TCP connection or for all
possible sets of work the node might be doing.

Incidentally, avoiding client-side arithmetic minimizes computational
impact on the nodes, sometimes the expense of a larger return packet.
This in turn could be mitigated by means of adding more controls to
permit the remote and dynamic reconfiguration of the daemon, although I
haven't yet done this as broadly as I could.  I keep it simple
unless/until there is a real need to add complexity, lest I get to where
the tool itself becomes as complex as /proc itself.

FWIW, the kernel does exactly this for (I'm sure) similar reasons:
provide the raw, instantaneous numbers and let userspace clients do
whatever selection, aggregation, and rate arithmetic they wish with
them.  It's not that difficult.  All the procps tools do it (starting
with a much uglier data interface to parse).  It's not the only
solution.  It MAY not be the "best" solution.  But it's not a bad one.

> Again the problem with XML and other extensible systems is that people use
> the flexibility to avoid thoughtful design.  Sure, it's obvious how to just
> add a few more records when we go from two to four core per socket.  And
> reporting system won't obviously break when we have 64 cores in each of 4
> sockets.  But it really just shifts the re-design work to the tools and
> applications.

I absolute agree with the first and have been saying it myself, although
I wouldn't say it is >>THE<< problem with XML, just >>A<< problem.
There are obviously data structures for which XML isn't a good solution
period, no matter how thoughtful you are, and IMO its BIGGEST problem is
that one cannot easily switch between its default human readable but
extremely "fat" encapsulation and a heavily compressed binary
encapsulation, by means of toggling a single switch in the library.
Toggle fat to debug, toggle thin, become efficient.  Also, at the very
least I'd qualify the "people" into "some people".  But that's a
UNIVERSAL problem, whether or not you use a library or data
encapsulation standard.  There's no substitute for thoughtful design,
but equally true there's nothing obvious or easy about it either.  It's
why good programmers are (often) well paid.

And truthfully I think it is clearly a BENEFIT (all things being equal)
to have a reporting system that won't break (obviously or not) when
going from 1 to N cores.  Note well that whether or not this breaks the
tools and applications depends on whether or not they were designed to
scale across a variable number of CPUs from the beginning.  The fact
that design decisions on a client-server application pair can be
obsoleted by changes beyond one's control on the server side is hardly a
general conclusion one can use to indict extensible encapsulations of
the data.  One (fairly obviously) needs to be aware of and code client
side applications to be able to manage variability where it exists, if
you can TELL where it exists.  The problem comes when something that
wasn't variable becomes variable, or where something that was in a
comfortable and familiar range suddenly isn't.  I'll resist making up a
car metaphor to illustrate the point, though...;-)

When you have a "sudden" change in the kernel -- some resource that only
existed on computers one at a time from time immemorial -- to a resource
that now comes N at a time, things are going to break.  Think of the
2.0.0 linux kernel -- suddenly there were two processors!  Tools that
KNEW there were never going to be more than one didn't know what to do!

In general, everything was assumed 1, now it is N.  Everything will have
to be fixed (or ignore the change and report only one, or as if it were
only one).  You are absolutely correct that this change requires a
client side change to manage, but that change needs to occur ONCE to
make the CLIENT aware of the N whatevers and arrange for IT to be able
to transparently manage 1-N of them as they appear in the return.  If
you do it badly, you say "well, now there are EITHER one or at most TWO
processors per system.  But I'll never need more than two, so all my
loops and display and processing decisions can use two as a hard upper
bound."

The price you pay is that the day dual duals or single quads, or
eight-ways appear, you can rewrite everything AGAIN -- expensive,
especially if you DIDN'T use an extensible data encapsulation scheme
(canned or homebrew) -- or tell yourself "well, I'll just hide the eight
and average them somehow back to two so my API remains fixed and my
client(s) do(es)n't break".  Too bad if someone wants to know per core
utilization on any system with lots of cores, not just aggregate
utilization, because the aggregate number you fudge will not reveal the
core saturation which could be just what you need to see on servers or
nodes that can be rate limited by a saturated processor handling a
single threaded task.

It isn't always easy to know what level of detail is going to be useful
to at least some users (including yourself, actually), so when one
builds a tool like this one has a tendency to support what YOU need, for
YOUR purposes -- right now.  Putting in "everything" is clearly
expensive but gives you everything -- lets you run a true "cluster-top"
or "cluster-ps" that requires access to "all" the contents of /proc on
all the nodes to function.  Putting in just load averages is cheap but
omits lots of detail that many people will need.  There is no unique and
obviously best solution to this problem, only tradeoffs.  The closest
one can come to best is "one that satisfies most of the people, most of
the time" where satisfaction has to take into account having all the
data they might need at a "cost" in resource utilization they are
willing to pay."  And it is again obvious that that best solution
usually going to be a somewhat VARIABLE solution -- one that provides at
least a bit of choice to the user to be able to select the best tradeoff
for their personal needs out of the available (limited) choices -- and
one that is extensible in dimensions (like core count or interface
count) where extension is likely to occur.

    rgb

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977