Top 500 trends

Robert G. Brown rgb at phy.duke.edu
Tue Nov 26 10:56:13 PST 2002


On Tue, 26 Nov 2002, Florent Calvayrac wrote:

> As always, your demonstration is coherent, fun and convincing, but you 
> forget :
> -that a cluster built of COTS can be transformed into a lot of desktop 
> PCs 3 years after its building, saving 30 % of the cost

Yeah, sometimes we've been able to do this.  For certain clusters
(generally mine, since my task is EP) I've even put the cluster right on
the desktops from the beginning to get dual use from them throughout,
saving 100% of the desktop cost (or 50% of the cost each shared between
the two purposes:-).  After all, for embarrassingly parallel tasks,
giving up the 1-2% of the CPU typically consumed by a day of hard
desktop use reading mail, websurfing, and playing mahjong is totally
painless, and reaps one numerous benefits.  Harder to do with
rackmounts, though, and for EP tasks I can actually use the nodes for a
lot longer than 3 years if they don't break so they don't always make it
to a desktop before they die.

> -it's maybe not worth building clusters with a thousand nodes, but one 
> with, say, 16 or 32 nodes can be very interesting at least for teaching 
> purposes, as it is way cheaper than a supercomputer. Students in 
> engineering are now routinely confronted
> to parallel computers when on a "real job" : finite-elements packages or 
> fluid dynamics codes are parallelized nowadays, and for, say, an 
> automotive company, it is interesting to complete a serie of numerical
> crash-tests in the next month rather than the next decade. The same 
> holds for drug developement : I am happy that IBM spends billions on
> "Blue Gene"  if it saves 5 years in Alzheimer or Creutzfeld Jacob
> disease cure.

I'm not arguing against building cluster computers at all.  I agree
completely with what you say here.  In fact, I think that we should be
building a lot MORE of those 16-64 node clusters and a lot FEWER of the
centralized, huge, 512-1024 node clusters, and in particular think that
the really big clusters should be subjected to a rigorous CBA before
funding them.  Obviously Weather, Alzheimers, Mad Cows, Cystic Fibrosis,
Aerospace Engineering and many other academic and corporate tasks have
huge potential payoffs and warrant big investments if/where they'll do
some good.

However, huge amounts of money have historically been spent on enormous
clusters working on a problem that isn't going away anytime soon and
with no discernable benefit to mankind other than the philosophical
satisfaction associated with knowing more about e.g. how the very, very
early universe got its start or whether the later universe will keep
going to infinity and beyond or stop and eventually collapse again.  I
mean, I'd like to know as much as anybody else, but not even the sun
will be around when it finally does what it's going to do, so it seems
that we could postpone trying to answer some questions like this (or at
least, slow down the rate at which we TRY to answer them) and wait for
Moore's law to make answering them cheaper.

I also think that the most important single reason to support lots of
the academic and institutional clusters that are built is supporting the
humans who work on it, not necessarily the science per se.  Those humans
are the ones that "invented" cluster computing, developed the beowulf
concept, wrote and improved the parallel programming tools, all the
while training the young scientists and engineers that are doing the
really valuable work that IS being done with clusters.

One could argue that one does better at supporting this human
infrastructure by supporting "sustainable" mid-sized (8-128 node)
clusters at a lot of institutions, where by sustainable I mean clusters
that you plan to rollover replace in fractions every year forever than
by buying one BIG cluster that you will use for only 3 years before
Moore's law and hardware failure start to crush you economically.  More
smaller clusters means more humans, and humans are the key resource and
one of the key benefits, not the hardware per se.

So I reiterate -- rather than strive for building a "Top 500" computer
(or a computer whose not-horribly-covert purpose is to "take back the
number one position from the fiendish Japanese":-) one should instead
examine the TASKS those computers are intended to work on and determine
whether they warrant spending all that money right now to get the answer
right away, or if they could do just as well plodding along with a much
lower level of annual investment, counting on Moore's Law to carry them
to the finish line at (curiously enough) about the same time.

One of my favorite little examples is the following:  Suppose that
computer "capacity" (speed, mostly, but everything else as well) doubles
at constant cost every 12 months.  Historically, it pretty nearly does
this in the mid-price-range sweet spot, and it is more convenient a
number than the 16-18 months that is more likely to be the truth.

You have (say) $50K to spend on computer hardware, total (I'll assume
that networking is "free" for this example -- perhaps 100BT suffices and
you already have a big 100BT switch).  Your budget has to pay for the
nodes themselves and any service/repair.

You can:

  a) Buy the systems all at once in the beginning.  Let's call the speed
of each system in year 1 "1".  This buys you 50 units of computing.
We'll assume that the probability of system failure is 10% per year, but
that the first year is under warranty.  Thus

  (end of) year     work done in year      cumulative work
==========================================================
        1                  50                    50
        2                  45                    95
        3                  40                   135
        4                  35                   170
        5                  30                   200

or you can:

  b) Buy $10K worth of systems a year, but each NEW year the systems you
buy double in speed.  Thus

  (end of) year     work done in year      cumulative work
==========================================================
        1                  10                    10
        2            9+20= 29                    39
        3         8+18+40= 66                   105
        4      7+16+36+80=139                   244
        5  6+14+32+72+160=284                   528

Allowing for failure along the way, you break even early in the fourth
year and accomplish more than twice as much work over the five years.
In fact, you'll do almost as much work with the new systems purchased in
year five ALONE as you do in all five years under plan a).

According to this argument, UNLESS YOU ARE WORKING ON A FINE GRAINED
PARALLEL TASK where there is arguably an advantage to homogeneity that
might justify the first route IF the communications are fast enough to
scale your (by hypothesis fine grained) computation to 50 units of speed
1, you should almost certainly utilize the second strategy.  

This is still true even if you use a more conservative 18 month doubling
time -- the human costs of maintenance and the costs associated with
feeding and cooling the relatively wasteful 50 speed 1 systems for five
years compared to feeding and cooling just 10, 20, 30..., throwing away
old units as they break after eating the dead, are utterly neglected and
according to some very serious CBAs sent to me by listvolken in the
past, swamp strategy a) EVEN IF YOU THROW THE PREVIOUS YEAR'S SYSTEMS
AWAY after 18 months (or give them to folks for their desks:-).  Human
time is almost always more expensive than hardware, especially on this
scale.

Given this, why do we see all of these hugely expensive superclusters
being built?  It is silly as hell to spend $300 million dollars on a
cluster all at once, period, for MOST of the tasks to which such a
cluster could be applied.  One is almost certainly better off spending
$100 million a year for three years, or $60M a year for five years,
instead of spending $300M on a cluster intended to last 3 years, or 5
years.

Of course, that isn't so sexy, doesn't dump as much pure profit in major
US corporate pockets to prop up the sagging tech industries, doesn't
make the "top 500 list".  It is a pure my-thing-is-bigger-than-yours
marketing tool, especially given that it lacks the COST of most of the
systems presented, making it impossible to answer the question "was this
system in the top 500 in terms of INTELLIGENT DESIGN AND COST
EFFECTIVENESS".  

Finally, I find it amusing and enlightening that the top 500 list is
almost certainly wrong anyway, at least after the top twenty or fifty or
so.  Clusters and supercomputers marketed by big hardware manufacturers
dominate the list at CPU speed ranges and numbers of nodes that are
pretty pedestrian from there on.  Hard to believe that there aren't a
LOT of beowulfs out there like number 370, with 160 1.47 GHz Athlons, or
338 with 128 1.7 GHz P4's or Dell's 207, with 128 P4 Xeons @ 2.4 GHz.
Seems pretty likely that this is because many of us who might own a
cluster that "should" be in this list don't care enough to enter it.
Perhaps a sign of maturity and security?  Perhaps because we care more
about cost/benefit in application to our own problems than to "linpack
performance" as a measure of "topness"?  Perhaps because we have jobs
(our own and to run on the cluster) that are more important than doing a
basically useless linpack benchmark and going through the hassle of
entering the race?

     rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list