[Beowulf] What services do you run on your cluster nodes?

Tue Sep 23 10:03:29 PDT 2008

On Mon, 22 Sep 2008, Perry E. Metzger wrote:

> 
> Prentice Bisbal <prentice at ias.edu> writes:
> > The more services you run on your cluster node (gmond, sendmail, etc.)
> > the less performance is available for number crunching, but at the same
> > time, administration difficulty increases. For example, if you turn off
> > postfix/sendmail, you'll no longer get automated e-mails from your
> > system to alert you to a problem.
> 
> If a machine isn't sending out more than, say, 20,000 email
> messages an hour, you won't notice the additional load Postfix puts on
> a modern machine with any reasonable measurement tool.
> 
> FYI, a modern box running postfix can handle millions of messages per
> hour before it starts getting into trouble.

The overall load isn't the issue, it's the scheduling interference.

If you have a dozen nodes working on a fine-grained, lock-step 
computation, nodes taking a millisecond off every second isn't noticed.  
If you have a few hundred nodes working on the problem, that millisecond 
is a huge problem.

We recognized this effect over a decade ago.  It was a motivation
when we designed the Scyld cluster system in early 2000, and was a key 
point when we started talking about it back then.  The effect has been 
independently discovered many times, but I think that we have one of the 
cleanest approaches.

We solved the problem by using a full featured, fully-installed head
("master") node that ran all standard services, and having the rest of the
nodes be start-from-zero compute slaves that don't run anything but the
application. This is much different than "what can I eliminate" mindset.
Designs that start from a full install and strip it down often eliminate
too much, or don't understand that unused "idle" things aren't really
free.

Idle daemons frequently wake up, look around, and go back to 
sleep.  Look at the research that has gone into making the Linux kernel 
"tick free".  The focus has been on power savings rather than HPC, but 
their findings provide third-party confirmation.  They eliminated periodic 
timer ticks, instead using a countdown timer to wake the kernel only when 
needed.  Except that so many things wake up, look around, and go back to 
sleep that they didn't see much savings!

The secondary effects are the real cost, and they are difficult to
directly measure.  Every time a daemon wakes, it kills application ...
uhmm "momentum".  It flushes a bunch of cache lines, and PTE lookaside
entries.  It might kick out a few pages and D-cache entries.  These might
break up application I/O that could otherwise be coalesced into a big
request.  How much time does all this cost?  Well, much of the time not 
very much.  But occasionally the coincidences stack up and 
become really expensive.  Like a single driver stopping during rush-hour 
traffic, the whole cluster-wide app stops.

Next posting: how the app itself can be the cause of slow-downs, and why 
cluster-specific nameservices and why library/executable memory 
"wire-downs" solve problems.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA