[Beowulf] 512 nodes Myrinet cluster Challanges

Robert G. Brown rgb at phy.duke.edu
Mon May 1 17:50:54 PDT 2006

On Mon, 1 May 2006, David Kewley wrote:

>>> For clusters with more than perhaps 16 nodes, or EVEN 32 if you're
>>> feeling masochistic and inclined to heartache:
>> with all respect to rgb, I don't think size is a primary factor in
>> cluster building/maintaining/etc effort.  certainly it does eventually

Aw, Mark, you ARE joking, aren't you?  C'mon...;-)

Sizes in general, along with rates and maybe prices, are the ONLY
factors that enter into quantitative cluster engineering, and the size
of the cluster (number of nodes) in particular is the prime variable in
all sorts of areas of cluster design from Amdahlian scaling laws to
engineering of the spaces designed to handle the clusters.  Many of
which are fundamentally nonlinear relationships.

Of COURSE there are different issues one has to confront building a
cluster with 16 nodes compared to building one with 1600 nodes (other
than aggregate MTBF, which certainly is an issue I agree).  Like
building your own small power plant and having AC units the size of a
tractor trailer outside a warehouse-sized space with carefully delivered
power and cooling vs sticking them in the office down the hall that
happens to be first in the AC delivery system and everybody thinks it is
too cold anyway.  Like ensuring that your IPC system can enable
acceptable speedup on the task the cluster is designed to do.  Like

>> become a concern, but that's primarily a statistical result of
>> MTBF/nnodes.  it's quite possible to choose hardware to maximize MTBF and
>> configuration risk.

Ah, I remember well my halcyon days when I too truly believed this.  I
also well remember the Tyan motherboards and in a SEPARATE incident
Taiwanese capacitors that permanently changed my mind, scarring my
psyche deeply in the process.  The point is ultimately that there is a
nasty nonlinearity in the impact of a "catastrophe" -- one not
necessarily linked to how carefully you pick your hardware -- where a
happy resolution ultimately depends on who pays to fix it and how
rapidly it gets fixed if it doesn't work out.

So sure, it is possible to choose hardware wisely.  Or not.  If you DO
choose it wisely, it is still quite possible that it breaks 13 months in
when everything is out of manufacturer's warranty, maybe even ALL of it
breaks quite rapidly (as in fact happened with the infamous Taiwanese
Capacitor, something that affected "good" and "bad" motherboards alike
as far as I could tell -- mine were certainly fine and worked well right
up to the minute they blew capacitor sputum all over the inside of my

Then the issue is "who's going to pay to fix it, especially when fixing
it will cost a signficant fraction of what the cluster cost in the first

If you've bought commercial nodes with 4 year contracts, the answer is
"they are", and the cost is a few days of downtime and a bitterly
cursing vendor and if anything, your users are impressed with your
foresightfulness in making service a part of the up-front cost of the
nodes.  If the answer is "we are", well, that is a bad answer I can tell
you, unless you happen to have a whole bunch of money you've carefully
reserved to pay for the hardware required, and even then it is STILL a
bad answer because of all the work required, and people hate you in the

Of course this is true for large and small clusters alike.  The only
advantage of a small cluster is that it is a lot more likely that you
have 16x$100 "lying around" to fix 16 nodes, and replacing 16
motherboards might take a day of human time for somebody armed with an
electric screwdriver and a pretty clear idea of how to recable.  1-2
days for a single person and a "miscellaneous" class repair budget was
the basis for my estimate of 16-32 nodes or less.

So it's really just an issue of insuring the risk more than MTBF per se.
Insurance companies exist because there are bad things that can happen
that on aggregate are very cheap but when they happen to you they are
very expensive.  It is easy enough to budget for insurance in the form
of service contracts -- you just build it into the cost estimates of
your cluster in the first place.  But how can you budget adequately for
disasters that you cannot predict, because if you could predict them
you'd avoid them?

And yet sure, you're right, there is a matter of judgement here and in
addition to very small clusters, for very large clusters and projects
with a large ongoing cash flow one can achieve both the numbers required
to make self-insurance viable and the cash flow to make it possible
without holding out part of one's budget (effectively defeating the
point of self-insuring).

So let's compromise with a standard Caveat Engineer and YMMV

> Ah, so my opinion is midway between Mark's & RGB's.  A very nice place to
> sit. :)

Yeah, but don't forget, the guy in the middle has to buy the beer.
Right, Mark?


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list