[Beowulf] What services do you run on your cluster nodes?
becker at scyld.com
Sun Sep 28 11:07:02 PDT 2008
On Fri, 26 Sep 2008, Robert G. Brown wrote:
> On Fri, 26 Sep 2008, Donald Becker wrote:
> > But that rule doesn't continue when we move to higher core counts. We
> > still want a little observability, but a number for each of a zillion
> > cores is useless. Perhaps worse than useless, because each tool has to
> > make its own decision about how to summarize the values before using them.
> > A better solution is to have the reporting side summarize the values.
> Why is this a better solution? Might not applications NOT wish to
> summarize or aggregate? And why does the cutoff occur at 2 cpus (and
> not 1).
There are four fixed slot for CPU utilization percentage, not two.
One rule for counting is 0, 1, 2, Many. You have to draw the cut-off
somewhere, and somewhere 4 or 8 is where the numbers stop being useful
when a human looks at them.
After 4 you find that you stop caring about what each core is doing, and
- how many cores are essentially idle / available to do work
- how close to fully busy are the occupied cores
- how close to completely idle are the idle cores
> And what do you choose to compute and return? Aggregated
> activity (not showing how it is distributed), or average activity (even
> worse, just showing a nominal percentage of total aggregate activity?
For the numbers reported per socket or per core, you report and use
utilization percentage. BeoStat also reports system load average, mostly
because people expect it. But the length of the run queue isn't a good
indication of how effective the node is getting work done.
> And how do you differentiate (or do you) between a single processor dual
> core and a dual processor single core and a single processor quad core
> and a dual processor dual core, etc?
The CPU core/socket enumeration naturally groups the cores within a single
socket. That might change next year when we have three cores per socket.
At that point we should do some redesign -- and redesign here means
predicting the future. A good approach is to group cores by which
channels to memory they use, and start reporting memory controller
utilization and contention. My prediction is that those memory controller
stats will be the best indication of still-available node capacity. CPU
utilization percentages will move from being the primary stat, to a
secondary stat -- the CPU/memory busy ratio used for reporting how
effectively the busy cores are being used.
> A network bottleneck on a system with multiple network interfaces shows
> up not necessarily as the aggregate being saturated, but as a particular
> interface being saturated. There may be multiple interfaces, and they
We support four network reporting slots: 0, 1, 2, and "all of the rest"
> may not even have the same speed characteristics -- "saturation" on one
> may be a small fraction of the capacity of another.
Finding the speed of a network is problematic. Even if we limit ourselves
to Ethernet frame format, there are several types of networks which fake a
speed report, and dynamically change speed. Non-Ethernet-like
networks are even more difficult, especially when they mix RDMA traffic
with packet traffic.
[[ OK, I'll admit this as a short-coming of BeoStat. When we designed it,
I knew we couldn't get accurate network speed numbers. Since I wrote most
of the kernel drivers, I knew all of the shortcomings, corner cases and
caveats. So we didn't even attempt to report a number, even statically.
Someone that knew less would make a sleazy assumption e.g. "100Mbps-HD,
100Mbps-FD or 1Gbps-FD" that would be right most of the time. ]]
[[ A secondary problem is that BeoStat doesn't re-order and identify the
networks. It just reports them as they are listed in /proc/net/dev. It
would be better to identify the networks as being used for booting,
control, message communication and file I/O. And then make order them so
that unused NICs aren't reported in the first three stat slots, leaving
important networks combined in the final, summary slot. ]]
> counts, or just the rates? In other words, who does the dividing to
> turn packet count deltas into a rate?
I think we have implemented a simple, general-purpose solution with two
reporting slots, and good-granularity timestamps. That allows programs to
compute the rate without keeping their own state.
Note that this doesn't attempt to fill the same role as RRD Tool
(Round-Robin Display Tool). That system keeps a long record of
historical values, and makes decisions about how to summarize and collapse
[[ A BeoStat redesign would include a feature to make it easier to
keep historical stats. If we included a ring buffer
logging which nodes slots had updated values, we could have have a
daemon that knew which stats had been updated instead of having to
scan the table slightly more frequently than the one-second update
> Incidentally, avoiding client-side arithmetic minimizes computational
> impact on the nodes, sometimes the expense of a larger return packet.
The arithmetic is trivial. We are talking about some additions, perhaps
averaging two numbers. There isn't anything time consuming. The biggest
cost is probably the floating point register context switch -- with lazy
FP register set switching, the first time you touch any FP register you
pay a big cost. If you do even one FP operation, even an implicit
conversion that doesn't look like real work, you might as well do a bunch
of FP work.
Donald Becker becker at scyld.com
Penguin Computing / Scyld Software
Annapolis MD and San Francisco CA
More information about the Beowulf