[Beowulf] What services do you run on your cluster nodes?

Robert G. Brown rgb at phy.duke.edu
Tue Sep 23 03:09:36 PDT 2008

On Mon, 22 Sep 2008, Joe Landman wrote:

> Prentice Bisbal wrote:
>> The more services you run on your cluster node (gmond, sendmail, etc.)
>> the less performance is available for number crunching, but at the same
>> time, administration difficulty increases. For example, if you turn off
>> postfix/sendmail, you'll no longer get automated e-mails from your
>> system to alert you to a problem.
> Does every node need to be running sendmail/postfix?  In most cases, nodes 
> should be fairly "dumb", in the sense of having as absolutely little as 
> possible actively running.  They largely need little more than an 
> authentication service, a login/process start service, a disk service (NFS, 
> panfs, glusterfs, ... ...).

One can always run xmlsysd instead, which is a very lightweight
on-demand information service.  It costs you, basically, a socket, and
you can poll the nodes to get their current runstate every five seconds,
every thirty seconds, every minute, every five minutes.  Pick a
granularity that drops its impact on a running computation to a level
you consider tolerable, while still providing you with node-level state
information when you need it.

Just a thought...;-)


>> My question is this: how extreme do you go in disabling non-essential
>> services on your cluster nodes? Do you turn off *everything* that's not
>> absolutely necessary, do you leave somethings running to make
>> administration easier?
> As long as you have an ssh portal in as root, you should be fine for admin. 
> Though, from an admin point of view, as you scale up the number of nodes, you 
> want the admin load to remain constant, that is, not to scale with increasing 
> node count.  Moreover, you want to actively reduce the number of moving 
> parts, as it were, as you scale up, as moving parts tend to break.  These are 
> things like installs, or images.  We have customers who occasionally (against 
> our advice) test the limits of their "cluster installer".  What is 
> interesting is that they can't *successfully* install/image more than about 
> 20-24 successfully at a time.  Yes they can install more than that, but no, 
> the systems they install that way seem to have some problems which go away at 
> next reload.
> Basically as you scale up the system, you want to scale down, if not 
> completely eliminate, node level admin.  You definitely don't want the nodes 
> to be spending cycles (and therefore power, time, resources) on things that 
> they really ought not to spend time on.
> Joe
>> I'm curious to see how everyone else has their cluster(s) configured.

