[Beowulf] What services do you run on your cluster nodes?
Robert G. Brown
rgb at phy.duke.edu
Tue Sep 23 03:09:36 PDT 2008
On Mon, 22 Sep 2008, Joe Landman wrote:
> Prentice Bisbal wrote:
>> The more services you run on your cluster node (gmond, sendmail, etc.)
>> the less performance is available for number crunching, but at the same
>> time, administration difficulty increases. For example, if you turn off
>> postfix/sendmail, you'll no longer get automated e-mails from your
>> system to alert you to a problem.
> Does every node need to be running sendmail/postfix? In most cases, nodes
> should be fairly "dumb", in the sense of having as absolutely little as
> possible actively running. They largely need little more than an
> authentication service, a login/process start service, a disk service (NFS,
> panfs, glusterfs, ... ...).
One can always run xmlsysd instead, which is a very lightweight
on-demand information service. It costs you, basically, a socket, and
you can poll the nodes to get their current runstate every five seconds,
every thirty seconds, every minute, every five minutes. Pick a
granularity that drops its impact on a running computation to a level
you consider tolerable, while still providing you with node-level state
information when you need it.
Just a thought...;-)
>> My question is this: how extreme do you go in disabling non-essential
>> services on your cluster nodes? Do you turn off *everything* that's not
>> absolutely necessary, do you leave somethings running to make
>> administration easier?
> As long as you have an ssh portal in as root, you should be fine for admin.
> Though, from an admin point of view, as you scale up the number of nodes, you
> want the admin load to remain constant, that is, not to scale with increasing
> node count. Moreover, you want to actively reduce the number of moving
> parts, as it were, as you scale up, as moving parts tend to break. These are
> things like installs, or images. We have customers who occasionally (against
> our advice) test the limits of their "cluster installer". What is
> interesting is that they can't *successfully* install/image more than about
> 20-24 successfully at a time. Yes they can install more than that, but no,
> the systems they install that way seem to have some problems which go away at
> next reload.
> Basically as you scale up the system, you want to scale down, if not
> completely eliminate, node level admin. You definitely don't want the nodes
> to be spending cycles (and therefore power, time, resources) on things that
> they really ought not to spend time on.
>> I'm curious to see how everyone else has their cluster(s) configured.
Robert G. Brown Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
More information about the Beowulf