[Beowulf] What services do you run on your cluster nodes?

Mon Sep 22 16:39:04 PDT 2008

Greg Lindahl wrote:
> On Mon, Sep 22, 2008 at 03:04:33PM -0400, Perry E. Metzger wrote:
> 
>> If a machine isn't sending out more than, say, 20,000 email
>> messages an hour, you won't notice the additional load Postfix puts on
>> a modern machine with any reasonable measurement tool.
> 
> Ah. So if I have 3,000 nodes, running an extremely tightly coupled
> app, and each one does postfix once per 30 seconds (that's
> 100/second), and each time one wakes up it causes the entire cluster
> to freeze for 1 quantum (1/250 of a second), how much work does the
> cluster get done?

Indeed.  Technology to avoid these kinds of problems are pretty common on
large clusters, things like:
* precisely synchronized clocks (so everyone takes the overhead at the same
  time).  (Psicortex, Octigabay/Cray, IBM Bluegene and a few others mentioned
  it I believe).
* Greatly simplied microkernels, even virtual memory can be skipped, ibm blue
  gene and several crays do not run a full (or even partial) linux kernel
  on compute nodes.
* Large page tables (fewer tlb events to though off the timing)
* bproc/scyld with a minimum number of processes (kernel + 1?)running there's
  less to interfere with application performance.

Basically when running lockstep on small jobs the worst case jitter of 32 or
64 CPUs may be a very small value, I've heard 0.5-5%.  As jobs hit 1000s of
CPUs the odds get much worse.  Without some of the above to mitigate the
problem a reasonable fraction of your compute power can go to waiting on the
slowest of the 1000 to handle a interrupts, schedule a different process,
tweak the TLB, or handle an interrupt.

Rocks is the exact opposite for better or worse, quite a few processes run on
compute nodes by default, 8 apaches, 5 gettys, 5 hald related processes, 4-5
SGEs (one that uses approximately 8 minutes of cpu per day), ganglia (68
seconds per day), and a ps list of processes somewhere in the neighborhood of
130 processes.

So if you aren't careful you don't pay just one nodes overhead on a job, but
all of them.   So the worst case with 1000 nodes would be 9 minutes * 1000 =
150 hours per day.  Without careful coordination to make sure that the
overhead happens at all nodes at the same time you could get into the
situation were at least one node of 1000 is always doing something else.
Even that assumes that things like ganglia's greceptor and sge's sge_execd
don't use more CPU per node on larger clusters.

In my tests I see pretty good scaling up to 64 nodes for the workloads we run,
but I'd certainly look closer if we started running substantially larger jobs.

I suspect that Cray, IBM, and related folks have spent the blood sweat and
tears on the compute nodes for very good reasons.  I've seen related papers
at various SCs as well as the Ottawa Linux Symposium, but do not have any
references handy.