[Beowulf] What services do you run on your cluster nodes?

Sun Sep 28 11:07:02 PDT 2008

On Fri, 26 Sep 2008, Robert G. Brown wrote:

> On Fri, 26 Sep 2008, Donald Becker wrote:
> 
> > But that rule doesn't continue when we move to higher core counts.  We
> > still want a little observability, but a number for each of a zillion
> > cores is useless.  Perhaps worse than useless, because each tool has to
> > make its own decision about how to summarize the values before using them.
> > A better solution is to have the reporting side summarize the values.
> 
> Why is this a better solution?  Might not applications NOT wish to
> summarize or aggregate?  And why does the cutoff occur at 2 cpus (and
> not 1).

There are four fixed slot for CPU utilization percentage, not two.

One rule for counting is  0, 1, 2, Many.  You have to draw the cut-off 
somewhere, and somewhere 4 or 8 is where the numbers stop being useful 
when a human looks at them.

After 4 you find that you stop caring about what each core is doing, and
instead ask
  - how many cores are essentially idle / available to do work
  - how close to fully busy are the occupied cores
  - how close to completely idle are the idle cores

>  And what do you choose to compute and return?  Aggregated
> activity (not showing how it is distributed), or average activity (even
> worse, just showing a nominal percentage of total aggregate activity?

For the numbers reported per socket or per core, you report and use
utilization percentage.  BeoStat also reports system load average, mostly
because people expect it.  But the length of the run queue isn't a good
indication of how effective the node is getting work done.

> And how do you differentiate (or do you) between a single processor dual
> core and a dual processor single core and a single processor quad core
> and a dual processor dual core, etc?

The CPU core/socket enumeration naturally groups the cores within a single 
socket.  That might change next year when we have three cores per socket.

At that point we should do some redesign -- and redesign here means
predicting the future.  A good approach is to group cores by which
channels to memory they use, and start reporting memory controller
utilization and contention.  My prediction is that those memory controller
stats will be the best indication of still-available node capacity. CPU
utilization percentages will move from being the primary stat, to a 
secondary stat -- the CPU/memory busy ratio used for reporting how 
effectively the busy cores are being used.

> A network bottleneck on a system with multiple network interfaces shows
> up not necessarily as the aggregate being saturated, but as a particular
> interface being saturated.  There may be multiple interfaces, and they

We support four network reporting slots: 0, 1, 2, and "all of the rest"

> may not even have the same speed characteristics -- "saturation" on one
> may be a small fraction of the capacity of another.

Finding the speed of a network is problematic.  Even if we limit ourselves 
to Ethernet frame format, there are several types of networks which fake a 
speed report, and dynamically change speed.  Non-Ethernet-like 
networks are even more difficult, especially when they mix RDMA traffic 
with packet traffic.

[[ OK, I'll admit this as a short-coming of BeoStat.  When we designed it,
I knew we couldn't get accurate network speed numbers.  Since I wrote most
of the kernel drivers, I knew all of the shortcomings, corner cases and
caveats.  So we didn't even attempt to report a number, even statically.  
Someone that knew less would make a sleazy assumption e.g. "100Mbps-HD,
100Mbps-FD or 1Gbps-FD" that would be right most of the time. ]]

[[ A secondary problem is that BeoStat doesn't re-order and identify the 
networks.  It just reports them as they are listed in /proc/net/dev.  It 
would be better to identify the networks as being used for booting, 
control, message communication and file I/O.  And then make order them so 
that unused NICs aren't reported in the first three stat slots, leaving 
important networks combined in the final, summary slot. ]]

> counts, or just the rates?  In other words, who does the dividing to
> turn packet count deltas into a rate?

I think we have implemented a simple, general-purpose solution with two 
reporting slots, and good-granularity timestamps.  That allows programs to 
compute the rate without keeping their own state.

Note that this doesn't attempt to fill the same role as RRD Tool 
(Round-Robin Display Tool).  That system keeps a long record of 
historical values, and makes decisions about how to summarize and collapse 
the log.

[[ A BeoStat redesign would include a feature to make it easier to
keep historical stats.  If we included a ring buffer 
logging which nodes slots had updated values, we could have have a
daemon that knew which stats had been updated instead of having to 
scan the table slightly more frequently than the one-second update 
period.]]

> Incidentally, avoiding client-side arithmetic minimizes computational
> impact on the nodes, sometimes the expense of a larger return packet.

The arithmetic is trivial.  We are talking about some additions, perhaps
averaging two numbers.  There isn't anything time consuming.  The biggest
cost is probably the floating point register context switch -- with lazy
FP register set switching, the first time you touch any FP register you
pay a big cost.  If you do even one FP operation, even an implicit
conversion that doesn't look like real work, you might as well do a bunch
of FP work.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA