[Beowulf] What services do you run on your cluster nodes?
becker at scyld.com
Tue Sep 23 09:27:09 PDT 2008
On Tue, 23 Sep 2008, Robert G. Brown wrote:
> On Mon, 22 Sep 2008, Matt Lawrence wrote:
> > On Mon, 22 Sep 2008, Bernard Li wrote:
> >> Ganglia collects metrics from hosts and trends them for the user.
> >> Most of these metrics need to be collected from the host itself (CPU,
> >> memory, load, etc.).
> >> Besides, the footprint of Ganglia is very little. I have yet heard of
> >> a user complaining that Ganglia uses too much resources. Of course,
> >> YMMV if you need every last CPU/memory for your job, then you should
> >> turn everything off at the cost of managing a blackbox.
> > Well, the folks I talked to at TACC were not enthusiastic about the amount of
> > resources ganglia uses. I will agree that there is a lot of unecessary stuff
> > that goes on, like converting everything to and from XML for each message.
> XML is (IMO) good, not bad.
I have so much to write on this topic, I'll take the first pot shot at RGB
XML is evil. Well, evil for this.
Ganglia does consume a significant portion of resources. I've heard
first-hand reports of 20% CPU. (Admittedly before they figured out what
was happening and turned the reporting list and frequency way down.)
When we found that we needed to support Ganglia, I grumbled mightily and
set out to figure out why it was so bad. Deep in its black heart I found
the source of evil: XML. And not just the detailed evil of the syntax,
the evil power had seeped into the higher level architecture with
The only way to remove the evil from Ganglia was to rip out its heart. I
wrote a simple translator that read the statistics from our BeoStat
subsystem and generated the final Ganglia XML. This eliminated the
heavy-weight reporting daemons on compute nodes and multiple rounds of
generating and parsing XML, while keeping Ganglia's familiar RRD-based
Those familiar with BeoStat might point out that it does not have the same
functionality. It doesn't report everything that Ganglia can. That's
exactly the point of BeoStat -- it reports only the essential state,
status and statistics for nodes. And it reports the same items, every time
and in every installation. It's much more easily used by tools because
they can count on which values will be reported, and the update frequency.
Static values are gathered once a "node_up" time, and changing values
are updated once per second while the node is up.
BeoStat is used by multiple schedulers, mappers, stat displays (e.g.
Ganglia) and GUIs. It's an essential part of the Scyld architecture and
reflects it design philosophy -- minimizing operational overhead on the
compute nodes, while simplifying and unifying the cluster. We use a
single reporting daemon per compute node, and the results are gathered
together and published at one place -- master nodes.
As usual, the devil can lurk in the details. So we wrote a system,
drowned it in holy water, and re-wrote it. Several times, until it rinsed
The results are reported as a UDP packet that fits in a single Ethernet
frame. This eliminates the issues with TCP, and allows optionally
switching to multicast[*2], although the default is to unicast to the
master node[*3]. They are small because the parsing, scale
correction and structure packing is done on the compute node.
The packets are sent once per second, each packet has a 'timespec'
timestamp, and each is further timestamped on reception. This handles
reporting aperiodicity, wall-clock skew, and allows liveness
determination. (But don't assume a node is dead if its BeoStat timestamp
is old. We use an separate TCP-based node management system for that.)
The receiving side is similarly carefully designed. In order to scale to
thousands of nodes without falling behind it has a streamlined receive
path. It reads the packet, verifies the header, and copies the packed
stats directly to a public readable shared memory region. Each node has
two stats slots. This allows calculating rate and rate-of-change info,
while the single writer and receive-side timestamp eliminates locks. By
putting the results in a world-readable shared memory region any tool can
efficiently map in and use the stats, either through the library or
Again with the details: the whole core system is written in C, calling
only basic library functions in a single-threaded process. There isn't
any interpreter layer, function calls that take an arbitrary amount of
time, or garbage collection. Those features usually don't cause problems,
but this is vital functionality and "usually don't" isn't as good as
Hmmm, that ended up a longer description than I intended, for just one
narrow aspect of our system. There is much more to designing an efficient
cluster system -- but those details will have to have to wait for the next
[*1] Yes, you read that correctly. The 'do anything, in multiple ways'
nature of XML encourages projects to not make the hard design decisions up
front. Just like on SourceForge when a pre-alpha project describes itself
as "a framework that aims to..."
[*2] BeoStat was originally designed as a multicast system. The concept
was that each node could optionally gather and maintain its own stats
table. This turned out to be a bad idea: multicast turns unreliable just
when you need it the most, switches discard multicast traffic
unpredictably, and multicast reception results in scheduling interference.
All of this to build a table that nothing on the compute nodes cares
[*3] Recently the guys at GaTech suggested that we change from the simple
UDP destination address (which gives us a choice of unicast, multicast or
broadcast) to a list of (presumably unicast) destinations. This adds a
trivial amount of packet construction overhead and code complexity, while
giving us many of the advantage of multicast. We can have multiple head
nodes that understand the cluster, spreading the reporting workload and
Donald Becker becker at scyld.com
Penguin Computing / Scyld Software
Annapolis MD and San Francisco CA
More information about the Beowulf