[Beowulf] Register article on Opteron - disagree

Mon Nov 22 09:59:03 PST 2004

On Mon, 22 Nov 2004, Patrick Geoffray wrote:

Warning and Disclaimer:  The following is a Rant, a genuine rgb Rant,
and hence represents a significant investment of time just to read
(imagine what it took me to write:-).

The faint of heart should quit now (and get some work done today...;-)

<rant>
> Hi Robert,
> 
> No, it's a statistical study. The Top500 list is relevant for what it 
> was designed for: tracking the evolution of the HPC market, ie where 
> people put their money.
> 
> Don't look at the top 10, don't look at the order, look at the 500 
> entries as a snapshot of the known 500 biggest machines. In this 
> context, you are looking for the trends: vector vs scalar, SMP versus 
> MPP, industry vs academia, etc.
> 
> Sure the vendors pay attention to it, and it's certainly a marketing 
> fight for the top10, but they cannot really influence the whole list, IMHO.

Ah, but for whom is this snapshot useful?  Who is, as you put it,
looking?  To me not at all -- my clusters have never intentionally
resembled or been influenced in any way by anything on the top 500 list.
To most of the people on this list, it is of very little use, again,
even including those seeking to build a fairly advanced and powerful
cluster or those who have built one that is represented there.  It
includes too little detail to support cluster design decisions or even
to properly represent the considerable work that went into e.g. ASCI
Red.  It isn't just the top 10 that has to be ignored to get down to
where the clusters are relevant to the needs or means of, say, a
University based research group, or department, or even centralized
computer cluster resources for an entire University -- it is almost the
entire list, or rather, even where the clusters MIGHT be relevant to a
given design goal, you cannot tell from anything on the list.

The list is used, I suspect, primarily by groups like:  groups,
especially corporate groups, seeking to purchase a turnkey supercomputer
cluster (following one of a few more or less accepted designs) just as
once they would have bought a turnkey supercomputer.  Vendors of
hardware promoted by the list, to see what the competition is selling
(and by "vendors" I mean stockholders, the board, sales and management,
and design and engineering, each for their own purposes).  Grant
seekers, granting agencies, and grant reviewers (often the more ignorant
of the above) concerned with the >>prestige<< associated with membership
in the top500.  People in general whose mouths might actually utter "OK,
so I'm about to spend all this money on a cluster -- just where does
that put me in the top 500?"  (Something I've actually heard more than
once, usually without a word about what the cluster is actually designed
to >>do<<.  Even from people that I >>know<< know better.)

Of these groups, the one whose needs are best served by the list are the
vendors, by far, as it is the perfect marketing tool for high margin
sales backed by remarkably little design effort.  The fact that there
ARE so many duplicate clusters is clear evidence that it has become an
ideal vehicle for selling a class of boilerplate, turnkey cluster
especially in the middle to lower end of the range.  "Buy our "over the
counter" cluster (tuned, of course, to your budget range and needs
within reason) and we'll guarantee a top 500 listing, which can't hurt
with the funding agency, with the board, with the many people who
control the spending of the money and who seek public exposure and
prestige for marketing reasons of their own but who DON'T know enough to
actually judge competing designs on intelligent grounds, eh?"

> >   a) It lists identical hardware configurations as many times as they
> > are submitted.  Thus "Geotrace" lists positions 109 through 114 with
> > identical hardware.  If one ran uniq on the list, it would probably end
> > up being the top 200.  At most.  Arguably 100, since some clusters
> > differ at most by the number of processors.  What's the point, if the
> > site's purpose is to encourage and display alternative engineering?
> > None at all.
> 
> When you take a snapshot, you take everybody on the picture, even the 
> N-tuples. If "Geotrace" has indeed 15 identical clusters (it is quite 
> frequent in industry to have several identical clusters BTW), they 
> should be on the list, otherwise your snapshot of the market is wrong.
> 
> Why would you expect *all* of the entries to be different ?

Who cares about a "snapshot of the market" except marketers or
economists?  Is this "computer science" and I missed it?  As I noted in
the last installment of the rant and up above, the people who most care
all have something to sell, and to sell to a relatively ignorant
marketplace, with basically no exception.  

Vendors to a class of client whose decision-making process is influenced
by the simple and predictable metric of top 500 position, vendor
management groups seeking to sell their strategy to boards whose
approval is influenced by the number of top 500 clusters sold (and ditto
vendor sales departments seeking to sell their sales strategy and
performance to middle management and ditto vendor boards seeking to sell
their strategy and performance to shareholders).  Clients seeking to
sell the clusters they want to buy to funding entities that are capable
of understanding how many "megaflops" (or gigaflops, or teraflops) they
are buying and the prestige associated being on the "in" list where they
are utterly incapable of actually judging a cluster's design,
suitability for any given purpose, or comparative cost-benefit. Even the
"bragging rights" are a form of self-sales, self-deception, to the
extend that they distract one from a proper design process.  It's
politics, not technology, driven.

If the list were actually going to be >>useful<< to thoughtful
engineers, would-be cluster purchasers or even would be sellers of
clusters OUTSIDE of marketing and the marketplace, if it were to be
TECHNICALLY userful, then obviously compressing the identical designs
down to a single entry and presuming that all clusters built according
to that design will have identical performance according to this metric
(perhaps along with a count) makes tremendous sense.  Information
theory.  It permits useful TECHNICAL information to be communicated far
more efficiently, at the expense of sales demographics information,
which is primarily useful to sales persons and weak-minded clients. 

> >   b) It focusses on a single benchmark useful only to a (small) fraction
> > of all HPC applications.  I mean, c'mon.  Linpack?  Have we learned
> > >>nothing<< from the last twenty years of developing benchmarks?  Don't
> 
> Haven't you learn the rule #1 of benchmarking ?
> Rule #1: There is no set of benchmarks that is representative enough.

I write benchmarking tools.  Perhaps badly (as Greg has pointed out, it
is Not Easy to write a really good benchmark:-) but well enough to have
learned all sorts of rules about benchmarking and to write about those
rules frequently on this list.

What you are calling rule #1 is more commonly phrased the other way
around:  The only truly reliable benchmark is your own application.

This presupposes that your purpose is to use benchmarks as some sort of
universal metric from which one can PREDICT performance on your
application.  The truth is, one often CAN use benchmarks for this
purpose but it is an expert-friendly process and not one that can --
usually or generally -- be reduced to a single number like "number of
Linpack FLOPS".  It presupposes a fairly deep knowledge of both the
application itself and where and how it spends time and what its
bottlenecks are, and the tools used to provide direct MICRObenchmark
peeks right at the relevant performance dimensions.

Have we not just spent a week and a half or thereabouts discussing
networking hardware that provides bleeding edge latency, distinct
topology-related performance scaling patterns, at costs that can
actually be compared (to a degree of approximation)?  >>This<< is what
is useful (to some, but not to me in particular at the moment it is but
one of many non-universal but very useful performance metrics), and is
one of many reasons that this mailing list is a critical resource to
real cluster designers and users where the top 500 club -- I mean "list"
-- is not.

Aside from a brief moment where vendors participating in the discussion
got carried away with dueling Shameless Plugs (which was, by the way,
annoying and clearly inappropriate for the list which is NOT to be used
as a marketing vehicle except indirectly, where you can let your
microbenchmark numbers and over the counter costs speak for themselves,
or at least be spoken for by actual users) this was a really, really
lovely discussion of the highest end networking technologies and bound
to be of real use to lots of people seeking to optimize cost-benefit for
actual cluster designs with known communication needs.  It was very
informative for others who DON'T have those needs -- it is very
interesting to hear topology discussed on the list once again, as there
was a period where it was only rarely considered as a design choice.

And don't get me wrong -- vendors, especially including yourself,
participated >>well<< in this discussion for the most part -- vendors
are welcome on this list BECAUSE who could know their hardware (or
supporting software) better?  None of us, however, want the list to
deteriorate to a de facto spamfest where marketing noise starts to
outweigh signal and even the list host Scyld exercises extreme
discretion in using the list to plug product without open solicitation
to do so.

> Linpack is a yard stick, that measures 3 things:
> 1) the system compute mostly right.
> 2) the system stays up long enough to be useful.
> 3) the system is big enough to be in the snapshot.
> 
> For these requirements, Linpack is just fine.

What do any of these mean?  "Mostly right"?  You mean linpack can get
wrong answers on a system that isn't openly incompetent in its basic
engineering or that linpack validates that the cluster engineers in the
top 500 aren't openly incompetent?  How does keeping a brand new cluster
up long enough for the seller (not even the client, but often the
seller!) to complete a single benchmark demonstrate that it will still
be running at a productive level in six weeks let alone six months?
"Look, the cluster we sold you actually survived its vendor burn-in and
completed linpack without crashing" is, I suppose, useful information
(considering the alternative:-), but the real hardware burn-in is the
next four to six weeks AFTER delivery, and the real test of the
cluster's long term reliability is whether the seller goes bankrupt
fullfilling the service contract or whether problems cost so much of the
user's productivity that it is severely compromised.  Which requires
looking at the cluster 1, 6, 12, 24 months down time's stream...which
WOULD be a really useful thing to publish, wouldn't it?  But it isn't on
the top500 list -- if half of the clusters there were broken nasty time
sinks that had their administrators cursing and losing hair we'd never
know it -- from the list data.

The final measurement is obviously of no TECHNICAL use at all, however
useful it is to marketers and however much it is the point of the whole
list. Pissing contest is just about right.

> I will tell you why the vendors would hate this: it takes a lot of time, 
> and they are not paid for this. Tuning Linpack on large machine is 
> already very time consuming. You want 10 more benchmarks to run ? Sure, 
>     who will run them ?
> 
> BTW, how do you cook up a cluster to run Linpack well ?

I hope that you see the contradiction inherent in these two paragraphs.

So on the one hand, vendors would hate having to provide real benchmarks
because a) they would have to run them (for the most incapable or
ignorant of their clients) in order to be able to provide them; b) they
would have to make sure that their performance on the benchmarks is
good, or they'll have to actually adjust prices to reflect their
performance measurements; c) tuning a cluster design to optimize
performance on any single benchmark or metric IS possible and in fact is
SOP, but it is also expert friendly, time consuming and unavoidable,
since running a simple test untuned might give Bad Numbers see b).

Most vendors I've talked to, BTW are actually pretty happy to provide
you with benchmark results, where they can, and provide access to
systems where you can create your own where they cannot.  It gives them
an opportunity to convince you that they are worth the money, and most
vendors are proud of their products and really feel that they are a good
value and worth selecting on the basis of price/performance.  

And finally, benchmarks ARE paid for (as is everything else) --
ultimately by the consumer.  Which one would I rather have a company
provide me -- a glossy magazine advertisement showing racks of neat
boxes and smiling geekoid administrators with their feet up on a desk,
or hard benchmarks and other data on the actual PERFORMANCE and
RELIABILITY of that same cluster?  Hmmm....

Then you ask me how to cook up a cluster to run Linpack well?  The same
way one cooks up a cluster to run ANYTHING well.  By analyzing the
application, determining the bottlenecks, and investing the money spent
on the cluster (including the de facto tuning process, which vendors
SELL their clients, after all) in a balanced design.

The problem is that a cluster design tuned to run Linpack well may be
horribly balanced for running a distributed Monte Carlo application
well, or a genomics application well, or a large scale cosmology
simulation well, and so top 500 ranking is misleading even where no
tuning occurs (and I'll guess that tuning always occurs, at least a
bit).  If one is running an application for which 100 BT communications
is more than adequate, spending money to equip every node with Myrinet
(not to pick on any single high-end network, you understand;-) is silly,
I'm sure you'll admit.  However, it is quite possible that a really big
cluster built on 100 BT with an impressive aggregate compute capacity by
many measures might lose out to a much smaller cluster with a more
linpack-friendly balance between CPU and IPC.  The fact that it costs
1/3 as much for the aggregate CPU it provides to ITS primary application
is also somehow lost.

Similarly, MANY clusters that are absolutely perfect for their
particular application that DO use very high end communications but are
truly IPC bound at (say) 64 nodes may be literally the fastest clusters
in the world at their particular class of application -- top >>1<<, not
just top 500 -- but still fail completely to make the top 500 at all.
How can these clusters even be defended?  "Oh, so you built the world's
fastest cluster, did you -- how come it isn't on the top 500 list,
then..."

It's especially amusing that you'd try to defend Linpack in particular,
as it has a long and checkered history of vendor tuning and
intervention.  Yes, it is rehabilitated and probably quite reliable at
this point, but history has already shown that vendors WILL bend designs
and skew results to make sales based on single metrics.

lmbench, on the other hand, is very easy to build on a system, very easy
to run (well, not openly DIFFICULT to build and run;-) and produces a
standardized report that can be directly compared across systems to
obtain meaningful information about how fast those systems perform a
wide range of relevant micro-scale operations.  stream is similarly
directly useful for a wide class of programs.  So is netpipe and
netperf, the former in the direct context of selected IPC mechanisms.  I
suppose one >>can<< tune for parts of these benchmarks, but generally
only at the hardware design level or driver design level, and the design
improvements are (at that level) likely to actually DELIVER their
benefits to all folks for whom the microbenchmark is relevant (those
whose code is significantly bottlenecked by the measure and who use the
improved hardware or drivers).

That's WHY we can have discussions about network latency on this list.
There are tools to measure it, and we can agree to use the same tool and
we can agree that it provides a FAIR measure of performance that is
RELEVANT to certain classes of application and affects overall
performance in predictable, if complex, ways.  When was the last time
somebody had a serious discussion about linpack performance on list?  It
happens periodically, but isn't terribly interesting when it does as
there are generally better measures for much of what it collectively
tests.

> 
> >   c) It totally neglects the cost of the clusters.  If you have to ask,
> 
> What is the price of a machine ? List price ? Including software ? 
> machine room ? service ? support ? how long the support ?
> You would need to define a benchmark to get comparable prices, and trust 
> the vendors to comply with it.

Actually, I think that it would be remarkably easy to assemble a top
500-like list that would be actually useful and include both a suite of
benchmark results for a variety of relevant cluser metrics, or to morph
the exiting top 500 list into a form that would be useful to somebody
besides marketers and their all-too-willing prey (ooo, did I say
that;-).  Yes, most of the changes would be anathema to marketers and
certain vendors, who rely on the NON-flatness of the playing field to
harvest margin.

Interestingly, I really think that vendors WOULD benefit from this as
much as anybody, just as my own experience is that most vendors are
happy to provide real benchmark results anyway (or the opportunity to
roll your own).  Real business performance is linked to clear,
everybody-agrees value delivered for the money spent.  You want your
customers to be happy a year, two years, four years down the road, which
won't be the case if they come to realize that the design you sold them
was a total rip-off.  One can sell a price-performance loser for a while
on the basis of reputation or the ignorance of one's clientele, but in
the long run this is likely to be a losing strategy.  At worst having a
balanced-performance multi-metric top-500 (with different categories of
membership and search tools to flatten across categories) would level
the field and encourage the marketplace itself to determine fair prices
between the various hardware alternatives.  Competition is good, as long
as the "fitness" used in the marketing genetic algorithm isn't a heavily
skewed function relative to real fitness.

> > best the cost after the vendor discounted the system heavily for the
> > advertising benefit of getting a system into the top whatever.
> 
> You mean I cannot really buy the VT cluster for $5M dollars ? What a 
> pity :-)

I was thinking of this very cluster while writing much of these rants,
of course.  If ever there was a cluster built for the wrong reasons, in
the wrong way, this has to be right at the top of the "Kids, Don't Try
This at Home" cluster list.  Build it once (for a fortune) and it sucks.
Rebuild it (for a second fortune) and it doesn't suck, but nobody has a
particular use for it or funding stream for it and it requires its own a
small power plant just to keep it running.  Maybe things are better by
this point, but that's the impression I've gotten of the project, at any
rate.

However, there is little doubt (in MY mind) that the mindset that
produced this cluster just to piss a bit further than the next one and
attract media attention and to establish a certain platform as a viable
cluster base for purely marketing purposes is rife within the top 500,
and only thinly disguised at that.

And it works.  Look at the number of Power-based systems appearing in
the Top 500.  Vendor clout matters, and membership in "the club" is
important and useful to those that can afford it.

> > I could go on.  I mean, look at the banner ads on the site.  Vendors
> > love this site.  If it didn't exist, they'd go and invent it.
> 
> Who is paying for the website, the bandwidth, the booth at SC, the 
> operational expenses ? The banner ads...

Who is, and who should be, are very useful questions.  Who is is likely
a mix of the vendors and the granting agencies that support the
sponsoring institutions and those institutions themselves (out of
opportunity cost time if nothing else in the latter case).  I wouldn't
be surprised if there was grant support openly behind the project; there
almost certainly is grant support behind the groups/individuals that do
the actual work of running it even if that support is nominally for
doing something else (common enough in academica).

Who should be is indeed the major funding agencies, either singly or
together.  Let's face it -- the website and the bandwidth are likely
nearly irrelevantly small expenses.  A database (a small and non-complex
one at that).  A straightforward lookup tool and associated links.  I've
built sites like this myself in my copious free time, and that may well
be how it is run this very day -- a good design and it runs itself,
mostly. A cheap server on a network with decent backbone bandwidth and
spare capacity (most universities and government labs would qualify).

Hiring a fraction of a full-time person to run the website at and
spending (say) $10K/year on hardware might cost $50K or even $100K,
including overhead.  To provide a >>useful resource<< to support all the
many, many would be cluster users seeking NSF funding to BUY those
clusters would be money well-spent and would result in saving 10 to 50
times the investment (at a guess) within the NSF alone. IF the
information provided by the resource was actually (technically) useful!

> There are no such things as "turnkey and mass produced" clusters, except 
> when a customer buy several instances of the same configuration. Most 
> configs are different.

There is "different" and "significantly different", where I don't view
systems with as being "different" unless the difference is in one of the
primary design metrics actually exercised by the test being used.

However, so fine, I agree.  Everything is different, even if one is
getting a Western Scientific Special (just one big enough to get on the
top 500 list). Then let's CELEBRATE the difference, let's RECORD the
difference.  Instead of a "top 500" site that celebrates a single metric
and a system description that is little more than a vendor name and a
number of processors (giving a bit of a lie to this assertion, by the
way) let's shoot for a "Cluster Registry" site where folks can
"register" their clusters in some detail.  Include a lot more detail --
CPU, clock, motherboard, memory, network(s), disk(s), network topology
and yes, actual price paid for the cluster itself, component wise or in
aggregate, exclusive of any software beyond the operating system (where
relevant) and general purpose programming tools and libraries, e.g. the
compiler (listed as separate entities if possible).

Then run a standardized suite of benchmark tools on it to get an ARRAY
of performance metrics, and price metrics, many of them microbenchmark
results (and hence difficult to "tune" as they measure properties of the
actual hardware architecture, operating system, and compiler chosen,
along with critical libraries along e.g. network communications paths
where relevant).  Store all of these in a nice, extensible database.
Insert a nice, powerful search engine that can look up and sort results
according to nearly any specified range of design and expenditure
metrics, and yes, you'd have a site that would be very, very useful.

ESPECIALLY if you included two Very Important Fields:  Owner comments
(where they have the opportunity to bash the vendor if the pile of PC's
they are sold turns out to be a pile of rubbish six months later) and
owner contact information.  If I >>did<< see a cluster in such a
registry that had good price/performance in the spectrum of metrics that
is relevant to my application(s), I'd very definitely love to have an
email address so I could ask the owner if they are still satisfied and
if there are any gotchas I should engineer out if possible.

We Could Do This.  Or at least, it could be done.  I once attempted to
do it for Duke (and ended up with a php/mysql based site that "worked"
so I >>know<< it can be done) but it does require a bit more than the
time a single, heavily overworked physicist who is learning php and
mysql WHILE writing the site can contribute without actually getting
paid for it...

There was a time back at the very beginning when the beowulf website
provided a service a bit like this, but yes, it is difficult to get
folks to register their clusters without some visible benefit.  Maybe
that's the miracle of the top 500 list -- it is useful enough to vendors
that they ARE willing to register the cluster more or less for you just
to get that market visibility, and some data on a single metric is
better than none at all.  Sigh.

> Let me say it again: the Top 500 is not designed to find the best 
> machine for your needs ! That the job of a RFP and the associated suite 
> of acceptance tests !!!
> 
> when you want to spend enough dollars to buy you a 1 TFlops machine, you 
> do your homework: you identify the set of benchmarks that you have 
> confidence in, you ask vendors to bid, you choose one and you test it 
> for acceptance. You don't tape the Top500 list to the wall and throw a 
> needle. If you do, I have this nice bridge to sell :-)

While I agree, I think there is a school in Virginia that is in the
market for just such a bridge, and that while yes the process you
describe is the ideal, the RFPs I've read from actual granting agencies
seeking to give away large sums of money are not infrequently seeking
linpack MFLOPs or (more recently) Spec* numbers in lieu of actual tuned
design metrics.

I'm idealist enough to hope that in the end good engineering does win
out, and that the groups that DO identify the actual performance
bottlenecks and spend the money optimally are rewarded with the actual
grants.  Oops, pardon me, I've got to go polish my rose-colored glasses
with Adam Smith's invisible hands -- they fogged up from my tears of
idealist joy...:-)

> > useful information about things like latency, bandwidth, interconnect
> > saturation for various communications patterns, speed for a variety of
> > actual applications including several with very different
> > computation/communication patterns (ranging from embarrassingly parallel
> > to fine grained synchronous).  Scaling CURVES (rather than a single
> > silly number) would be really useful.
> 
> Sure, and you do it all again every 6 months.
> 
> Seriously, take something like bandwidth: it is pipelined or not ? 
> cold-start or not ? Back-to-back or through a switch ? how many stages 
> in the switch ? do you count a megabits as 1000000 bits or 1048576 bits 
> ? Let's imagine the mess when we talk about communication patterns...

What mess.  That's what we HAVE been talking about on this list.  It is
what this list DOES talk about.  And yes, you do it all again, and
again, and again as systems evolve and new technologies emerge and
become cost-competitive with the old, which is why Myricom cannot rely
on its market kudos of years past to guarantee future success.  To the
extent that cluster design is rational and not top 500 marketing hype,
precisely the issues you raise are relevant to the would-be cluster
owner, and there are real metrics that CAN be derived for accepted tools
to provide the answers.

Complex yes -- unavoidably so for complex network-parallel applications.
However, complexity isn't necessarily a mess, and even a mess can be
straightened out and organized according to certain principles.

I'm not looking for the moon, here, BTW.  I just think that we could do
better than ranking everything according to Linpack (only).  In fact, I
think it would be difficult, literally, to do any worse!  I mean, how
about aggregate bogomips?  I suppose that would be worse.  Maybe.

> > I mean, this site is "sponsored" by some presumably serious computer
> > science and research groups (although you'd never know it to look at all
> 
> The Top500 team is 4 people, with the same guy doing most of the work 
> since the list was created. There is no "groups" behind it.

I was referring to the three institutions listed in the upper left hand
corner of the site (as toplevel "groups").  Also, everybody who works
for a University or government lab typically works for a "group" with
some specified funding stream, and Dongarra, at least, is a serious
computer scientist whose work I (generally) greatly admire.

Maybe they haven't been able to get e.g. NSF to fund it directly.  Maybe
they actually make money from it and use running the list to fund other
stuff.  I don't know, I just find streaming ads annoying.  Would we all
still use this (the beowulf) list if vendors hammered us with visual ads
and sales every time a technical issue was raised as a topic of
discussion?  I very much doubt it.  Yet the top 500 list IS clearly very
influential of purchase decisions -- so much so that vendors go to
enormous lengths and great expense to ensure that they are represented
there if at all possible and to maximize their representation there when
it IS possible.

> > sponsoring institutions are listed).  If they want to do us a real
> > public service, they could do some actual computer science and see if
> > they couldn't come up with some measure a bit richer than just R_max and
> > R_peak....
> 
> BTW, the 4 people aforementioned know quite a bit of computer science...
> The fact is that the Top500 is a statistical study, not a public 
> service. You can use it as it is or not, but it does not really matter.

As I said, I'm more than aware that they are competent CS people.  I'm
also highly aware of statistical studies and the value thereof
(statistics is one of my many hobbies:-), and if the top500 is a
statistical study, it is in my opinion a very narrow one.  It requires
voluntary participation (self-selection) and excludes by design a vast
range of high performance computing operations to further narrow its
range.  These distortions bias the results, although yes the bias is
uniformly applied and creates a self-consistent landscape.

You might as well use Mensa members (only) as the basis of meaningful
statistical measures of intelligence in the general populace.  Try not
to think too hard about all the brilliant people who don't belong to
Mensa, including all the people who may not even test out that well in
I.Q. but somehow manage to win Nobel prizes anyway.  Also try not to
think too hard about the kind of person that would WANT to belong to
Mensa, and why.  Not to particularly pick on another "top 500"-like
club... I'm illustrating a point in statistical surveys with
self-selection as opposed to unbiased sampling.

However, the value of statistics is generally the correlations that it
establishes, even though (yes) correlation is not causality and YMMV.
Most of what my rant is concerned with is that the correlations that CAN
be established from the top 500 list are so shallow as to be (by your
own admission) nearly useless to real cluster design or informed
purchase decisions, however much they ARE very useful to vendors seeking
an oversimplified bazaar to sell their products.

Still, your point is well taken -- list usage is a voluntary matter, and
it can be taken for whatever it is worth even if it isn't worth all that
much.  Even a survey of the coffee-drinking preferences of Mensa members
might mean something useful to coffee marketers, even if it was
something much narrower than what they might have hoped for.

> > AMD has more or less "owned" the price/performance sweet spot for the
> > last two years.  If you have LIMITED money to spend and want to get the
> 
> What is the "price/performance sweet spot" ?
> How do you measure it ? How do you know that AMD owned it ?

For most consumers the bottom line is, quite correctly, how fast
reliable hardware runs their application for the least cost.

The sweet spot is thus typically where numerical performance by a
variety of metrics (e.g. sure, linpack, but also SpecFP, SpecInt, and
especially user's own applications) is the greatest for the least money
spent, per node, compared to equivalently equipped hardware from other
vendors.  And yes, this is approximate and yes, I can fancy it up as
much as you like but we'd still both know what I mean.

To measure it, one ULTIMATELY runs one's own application.
Alternatively, especially during the cluster design process, one
considers a variety of metrics (many of which are mentioned above) with
some known or expected relevance to one's own application, in a hardware
configuration capable of providing performance on a par across the
alternative being considered.  For CPU/motherboard/memory choices, that
would typically be networking and disk.  Numerical performance, in turn,
is often influenced strongly by memory bandwidth for certain
applications, by processor design (e.g. pipelining etc) and compiler and
OS support.

So run your application on it if you can, run some benchmarks that tend
to have performance variability that is related to how your
application's performance varies (if you are lucky and know of some),
then compare prices from vendors that provide all of what you need
outside of the primary performance predictor space.  I know you know all
of this and work with it (for networking hardware) for a living -- it is
just that yes, there is a sweet spot, both for "each" cluster
purchaser/user, and an aggregate/average sweet spot for all cluster
purchase/users.

As for how I know it, from my own tests (limited as they may be) and
other purely anecdotal experience, of course.  When thinking of a
cluster design for a new cluster, I run lmbench, cpu_rate, my own
application ;-), the applications of many of my good friends who use
clusters for a living, look up spec results and might even look up
linpack results for Opterons or Athlons before them at a variety of
clock and memory configurations.  I look at the price of equivalently
equipped and supported CPUs in a given class across the range of clocks
supported by each to the extent that patience permits.  While there are
doubtless exceptions (although I have yet to find one) I've just gone
through such a process and the Opterons at this particular moment in
time (In My Opinion and as of Some Weeks Ago, things change rapidly)
deliver the most general purpose "numerical" performance per dollar
spent, in reliable and consistent hardware configurations from a variety
of vendors, end of story.  The Athlons and Durons were actually narrower
winners in their day, but in the case of the Opteron it really has been
little to no contest, EVEN with the 2.4 kernels wasting a goodly chunk
of the potential of the hardware.  With 2.6 kernels they simply fly.

Not terribly surprising, actually.  The technical and performance
advantage of the 64 bits is at least partly offset by Intel's commanding
market advantage, causing a relative price inversion so that AMD's best
is priced better than Intel's best.  Number 2 has to try harder.

Now, if you object (perfectly reasonably) that THIS isn't an objective,
reproducible, statistically supported conclusion but instead is
anecdotal bullshit that might or might not be true, you'd be perfectly
correct (not that it would change my opinion:-).  I'd counter that this
is precisely why having a MEANINGFUL statistical sampling of MANY
performance measures on real production clusters at many scales and
purposes would be so very useful, and why the top 500 list is so
-- disappointing -- in contrast.

> The Top500 list tells you that people put their money on Intel more than 
> on AMD. If you assume that people do they homework when they buy such 
> machines, you may deduct that Intel was more attractive in the bidding 
> process.

There are so many flaws with this observation that I won't even bother
addressing them.  Next you'll be arguing that Windows is a technically
superior operating system because Look!  Countless Surveys Prove that
more systems are sold running Windows than all of the alternatives put
together!

Not to mention the self-fullfilling prophecy aspect of the top 500
feeding upon itself.  Perhaps the reason there are so many Intel systems
represented is because there have been so many Intel systems represented
in the past, and because people really ARE strongly influenced in their
purchase decisions by the top 500 list itself.  Perhaps it is because
Intel twists the arms of many of the potential vendors of systems at
that scale to use only their processor in everything that they sell or
pay more for everything (shrinking their margins) just like Microsoft
does with Windows.  Perhaps it is because they charge such a high margin
compared to the actual performance that they can afford to discount
cluster components heavily to specific "big" purchasers and still make
money while giving the consumer the illusion that they are "getting a
bargain".  Perhaps it is the eternal "TCO" chant of marketers everywhere
seeking to justify their high-margin sales (sometimes with reason, mind
you -- I'm just observing that this particular mantra plays to the
advantage of the top player in any given market).

I mean, c'mon.  Even the best of humans have a bit of sheep within them.

> I can understand how the Top500 is not matching your expectations, but 
> the truth is that it was designed to match them. What you are looking 
> for just does not exist. It's not utopia, you could really imagine a 
> company following the SPEC model: vendors pay for submitting (a lot of) 
> results and results are reviewed seriously. All you need is to convince 
> vendors to spend a lot of money on it. Is the HPC market big enough for 
> that ? I doubt it.

I agree that the top 500 model is not matching my expectations, and I
think that (deep down) we agree as to the groups for whom it IS matching
expectations.  However, we aren't limited to just the Spec model. We
could use a different model altogether, one that we create and that need
not cost a lot of money on anybody's part to play.

I have great hopes for Doug's efforts to build a suite of cluster
benchmarking and testing tools. Another group that CAN play the role of
watchdog in this sort of thing is the news media -- a magazine, for
example.  Hmmm, the magazine I write for, for example.  Even with
magazines one has to be a bit careful -- they rely on vendors for
advertising dollars -- but generally to my own experience they manage to
present a fair and balanced view of performance analyses and tests.
Magazines have most definitely developed their own suites of testing
programs many times in the past and published regular comparative
reviews of the available hardware on that basis.  And of course you can
trust ME implicitly...;-)

So two (of many) possible funding/support models is for CMW (for
example) or an NSF-supported site (for another example) to set up a
"rich" database like the one I describe above, and to provide a
prepackaged, ready to build and run suite of open source benchmarks to
support a unified and consistent measure of performance.  The cluster
community appears to be big enough to support its own magazine and its
own targeted advertising -- maybe it has grown to where it CAN support a
real database of real performance metrics across a broad base of cluster
owners.

Note that I say just "build and run", the benchmark suite, not "tune",
at least at the software level.  Ideally, this would be a matter of e.g.
rpmbuild or ./.configure;make (possibly after editing a single toplevel
configuration file to select e.g. compiler and network and disk
resources to be tested and to specify the test cluster).

The tool itself could even be made autoreporting -- run it and it
refreshes your password- or cookie-protected entry in the master table.

So I think it is doable, and doable for very little more effort than
that being spent already on the top 500 list.  I also think that it is a
WORTHWHILE, DESIREABLE, GOOD thing to do.

Perhaps it will happen.

Now, if I could only repackage the above and turn it into a CMW column,
Life would be Good.  Sigh.

Heck, maybe I will.  Then I'd get paid for writing it...;-)

</rant>

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu