[Beowulf] Performance metrics & reporting

Tue Apr 8 14:32:43 PDT 2008

On Tue, 8 Apr 2008, Jesse Becker wrote:
> Gerry Creager wrote:
> > Yeah, we're using Ganglia.  It's a good start, but not complete...
> 
> The next version of Ganglia (3.1.x) is being written to be much more easy to 
> customize, both on the backend metric collection by allowing custom modules 
> for gmond, and on the frontend with some changes to make custom reports easier 
> to write.  I've written a small pair of routines to monitor SGE jobs, for 
> example, and it could easily be extended to watch multiple queues.

It might be useful to consider what we did in the Scyld cluster system.

We found that a significant number of customers (and potential customers) 
were using Ganglia, or were planning on using it.  But those that were 
intensively using it complained about its resource usage.  In some cases 
it was using 20% of CPU time.

We have a design philosophy of running nothing on the compute nodes except 
for the application.  A pure philosophy doesn't always fit with a working 
system, so from the beginning we built in a system called BeoStat (Beowulf 
State, Status and Statistics).  To keep the "pure" appearance of our 
system we initially hid this in BeoBoot, so that it started immediately at 
boot time, underneath the rest of the system.

How are these two related?  To implement Ganglia we just chopped out the 
underlying layers (which spend a huge amount of time generating then 
parsing XML), and generate the final XML directly from the BeoStat 
statistics already combined on the master.

This gave us the best of both worlds: no additional load on compute nodes, 
lower network load, much higher efficiency, and easy scalability to 
thousands of nodes from BeoStat, and the ability to log and summarize 
historical data, good-looking displays and ability to monitor multiple 
clusters from Ganglia.

It might be useful to look at the design of Beostat.  It's superficially 
similar to other systems out there, but we made decisions that are 
much different than others -- ones that most consider wrong until they 
understand their value.

Some of them are:
  It's not extensible
  It reports values in a binary structure
  It's UDP unicast to a single master machine
  It has no liveness criteria
  The receive side stores only current values 

The first one is the most uncommon.  Beostat is not extensible.  You can't 
add in your own stat entries.  You can't have it report stats from 64 
cores.  It reports what it reports... that's it.

Why is this important?  We want to deploy cluster systems.  Not build a 
one-off cluster.  We want the stats to be the same on every system we 
deploy.  We want every tool that uses the stats to be able to know that 
they will be available.  Once you allow and encourage a customizable 
system, every deployment will be different.  Tools won't work out of the 
box, and there is a good chance that tools will require mutually 
incompatible extensions.

Deploying a fixed-content stat system also enforces discipline.  We 
carefully considered what we need to report, and how to report it.  In 
contrast look at Ganglia's stats.  Why did they choose the set they did?  
Pretty clearly because the underlying kernel reported those values.  What 
do they mean?  The XML DTD doesn't tell you.  You have to look at the 
source code.  What do you use them for?  They don't know, they'll figure 
it out later.

People next question "but what if I have 8/16/64 cores?  You only have 2 
[[ now 4 ]] CPU stat slots."  The answer is similar to above -- what are
you going to do with all of that data?  The answer is summarize it before
using it.  We just summarize it on the reporting side.  We report that 
there are N CPUs, the overall load average, and then summarize the CPU 
cores as groups (e.g. per socket).  For network adapters we report e.g. 
eth0, eth1, eth2 and "all the rest added together".

Once we chose a fixed set of stats, we had a ability to make it a fixed 
size report.  It could be reported as binary values, with 
any per-kernel-version variation done on the sending side.

Having a small, limited-size report meant that it fit in a single network 
packet.  That makes the network load predictable and very scalable.  
It gave us the opportunity to effectively use UDP to report, without 
fragmenting into multiple frames.  UDP means that we can switch to and 
from multicast without changes, even changing in real time.

A fixed-size frame makes the receiving side simple as well.  We just 
receive the incoming network frame into memory.  No parsing, no 
translation, no interpretation.  We actually do a tiny bit more, such as 
putting on a timestamp, but overall the receiving process does only 
trivial work.  This is important when the receiver is the master, which 
could end up with the heaviest workload if the system isn't carefully 
designed.  We've support 1000+ machines for years, and are now designing 
around 10K nodes.

We actually do a tiny bit more when storing a stat packet -- we add a 
timestamp.  We can use this to figure out the time skew between the master 
and computer node, verify the network reliability/load, and to decide if 
the node is live.

This isn't the only liveness test.  It's not even the primary liveness 
test.  We document it as only a guideline.  Developers should use the
underlying cluster management system to decide if a node has died.  But if 
there hasn't been a recent report, a scheduler should avoid using the node.  
Classifying the world into Live and Dead is wrong.  It's at least Live, 
Dead and Schrodinger's Still-boxed Cat

Finally, this is a State, Status and Statistics system.  It's a 
scoreboard, not a history book.  We keep only two values, the last two 
received.  That gives us the current info, and the ability to calculate 
rate.  If any subsystem needs older values (very few do) it can pick a 
logging, summarization and coalescing approach of its own.

We made many other innovative architectural decisions when designing the 
system, such as publishing the stats as a read-only shared memory version.  
But this are less interesting because no one disagrees with them ;-).

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA