[Beowulf] 512 nodes Myrinet cluster Challanges
Robert G. Brown
rgb at phy.duke.edu
Wed May 3 13:24:09 PDT 2006
On Wed, 3 May 2006, Vincent Diepeveen wrote:
> Let me react onto next small quote:
>> You are somehow convinced that institutions buying clusters are brain
>> dead and always get ripped off. Some are, but most are not. You don't
>> have all of the informations used in their decision process, so you draw
>> invalid conclusions.
> Ok a few points:
> First a simplistic fact:
> 0) the biggest and vaste majority of supercomputers get bought by
> (semi-)government organisations
What does this mean? Are Universities semi-government organizations?
How about IBM, Cray, Hitachi? How about oil companies? And just what
is "a supercomputer"? In the old days it meant a big iron system; now
it more often means an HPC cluster of some sort or another that is "a
computer" in the beowulfish sense more by virtue of naming than of
function per se as its capabilities may or may not be brought to bear on
a single problem for any significant fraction of its duty cycle.
I ordinarily have little use for the top500 site, but eyeballing it one
might possibly conclude that the biggest supercomputers do indeed get
bought by the government (as the BIGGEST ones cost so much that nobody
else can afford them). So the top 100 is dominated by government, by
Universities (purchased, doubtless, with government money if that makes
them "semi-government") and big computer companies.
>From there on down your statement is just plain wrong. Corporations
appear to be in the slight majority in at least the second 100 and
beyond, followed by Universities and with the government third, although
this is based on eyeball counts because I'm too lazy to do better.
Somewhere on the top500 site or a derived site there is probably a pie
chart with all of this information predigested, though, for googlers out
there. Lots of oil companies, banks, semiconductor companies,
telecommunications companies, engineering firms, car companies,
aerospace companies. Pretty much who you'd expect to be interested in
large scale cluster computers on a commercial basis, actually -- and who
can afford them.
As always, the top500 presents a very skewed view of the supercomputing
world, as I >>personally<< consider a 32 node beowulf cluster running a
single very fine-grained synchronous task that peaks in its parallel
scaling speedup for the IPC channel used at 32 nodes to be "a
supercomputer". A well-designed, cost-benefit optimal one at that. A
32 node generic grid-style cluster is arguably "a supercomputer" in that
it significantly improves on doing von Neumann-style serial computing on
a single processor. Who >>knows<< what the distribution of ownership is
of 16, 32, 128 node clusters (things below the top500 radar) or
corporate computers that they didn't bother to register on the top500
site (perhaps because running their benchmark is, really, a waste of
corporate resources and time unless the vendor does it free, which is
what likely DID happen for many of the entries).
> That in itself should already raise questionmarks on our heads. Even though
> some companies need massive
> power, they somehow seem to be smarter and manage to do the job in a
> different way.
Do >>what<< job in a different way? WHAT different way. If you want to
do large scale HPC-generic computations, there aren't really a hell of a
lot of options. It is "do them on a cluster", or perhaps "do them on a
cluster". And then there are those that just "do them on a cluster". A
VERY FEW of the clusters are big-iron-y clusters with custom IPC
channels. The "biggest and vast majority" of them are just plain old
beowulf-style or grid-style HPC compute clusters made up of a large pile
of "independent" motherboards, memory, and processors interconnected by
a bus-linked network. The network is minimally 100BT (or more
reasonably by this point 1000BT) but goes up to and includes Myrinet,
Infiniband, Quadrics, etc. at the high end, where high end means:
a) Highest bandwidth
b) Lowest latency.
The high end interconnects are expensive. The low end ones are cheap.
At the highest end, things get all techno-detailed and price/performance
depends on a lot of things including (in a very specific way) the TASK
MIX and SIZE OF THE CLUSTER and DESIGN OF THE MOTHERBOARD (especially
the cpu, bus and memory subsystems), YMMV, Caveat Engineer, it's Your
Money so Spend It Wisely.
So what exactly did you mean about "a different way"? Something
different from piling up CPUs and memory and interconnecting them with
an OTC network? Does that belong on this list?
> However i can now only talk and speak about how government works, as all
> those machines belong to
> There is now suddenly a great distinction between university level and
> government level.
> SOME universities do pretty well actually and the points underneath do not
> apply to them
> a) Let me give a simple example. In the supercomputer europe 2003 report
> they still seem to not know that
> opteron processors support SSE/SSE2, whereas, i'm guessing this report
> was used to order for 2004/2005
> As a result of that intel processor had an advantage over amd, whereas
> that AMD processor in SSE2 already
> was performing in tests for me in DFT faster than any intel processor.
> Even though i reported
> this to the persons in question and the organisation, i never got
> When i read that report 2003 which was available in start 2004, i
> already knew what kind of machine
> they were gonna order for 2005.
>From which we can conclude that some people are stupid, and others are
greedy? That's no surprise, and what's your point (beyond that)? For
the most part, even government machines (at least in the US) are
designed one at a time for a specific task, are grant funded for
performing that task, are purchased and assembled and perform that task
and any others that might come up in the venue where they operate.
There is no such thing as a "supercomputer report" that recommends Intel
vs AMD and plenty of knowledgeable people on this list and elsewhere
have selected Opterons for years now.
The biggest problem with Opterons three years ago was that operating
system/distro and compiler support sucked (in Linux, at least), not the
chips. But as of FC2 and beyond (in linux distro terms) they most
definitely did not suck, and nobody not stupid or greedy (and getting
some sort of a goose from Intel, or suspicious of any non-Intel
solutions for stupid reasons) would have written off Opterons from that
> b) hardware moves fast. What is fast this year, is slow next year. A simple
> indexation *previous* year what
> was available that year and only use *that* to order a machine for
> *next* year, is slow moving.
> Companies in general are faster there.
Vincent, companies are NEVER faster in the cluster world. They cannot
afford to be, literally, for one thing (with the possible exception of
e.g. IBM etc that MAKE the computers as well as build the clusters).
Beowulf-style clusters were "invented", if that is the right term for a
distributed collaborative event that spanned dozens of venues, between
roughly 1989 and 1996, by a mix of University and government supported
workers in the fields of computer science and physics (mostly). I
personally credit Dongarra's PVM team with the invention, since before
PVM commodity clusters pretty much rsh-based workstation compute farms
(nothing to sneeze at, BTW, and I was right there using them in
precisely that way).
>From PVM on, clusters have ALWAYS been done fastest and best by
governments and University's or computer companies drawing on the brains
and soft technologies developed at government and University centers.
In the process the successful design (clusters of independent COTSish
systems interconnected with COTS IPC channels ie networks) has
absolutely wiped out the "corporate" best effort of the day, which was
almost invariably a big iron supercomputer design (even if that design
was a cluster in disguise). Today only a handful of special purpose
vector-type machines still survive AFAIK, and a whole lot of list time
has been devoted to making clusters out of vector processors (e.g. DSPs)
of one sort or another.
Companies only copy(d) the successful models developed on this list
(among other places, but a WHOLE LOT was done by participants on this
list) when it was "safe" to do so, when the engineering was all cut and
dried (and there were enough people trained to do it), and when in
particular the cost/benefit was so painfully obvious that they had no
> c) if you learn what price gets paid for the hardware effectively, you will
> fall off your chair. The so called public
> costs that governments sometimes tell publicly simply aren't true what
> they *effectively-all-in* pay.
> A few good guys excepted, but in general it's simply not their money, so they
> don't care.
> Sometimes there are good reasons for this. So it's not like that the persons
> having the
> above actions are bad guys. It's easier to get 20 million euro from goverment
> sometimes than to get 1 million.
Sigh. There is probably some truth to this, unfortunately, because of
the nonlinear memetic-evolutionary advantage accrued to "status" and
"visibility" in our collective/democratic world. We don't have any
all-wise and all-knowing masters telling "us" the best things to fund,
the best prices to pay, so the decisions get made by people on the basis
of what they know and a collective decisioning process. So what DO they
know? What they've heard of, what has made an impact, a splash. It is
all about marketing.
A secondary point is that the government often has an ulterior motive in
funding big apparently pointless computers for grand-challenge type
projects. It supports the cash flow and R&D of the companies that bring
us each new generation of miracles. Some of the largesse spills over
into much more humble programs. The large scale purchases push
manufacturing margins out to where prices collectively lower. It keeps
the countries to which those companies belong competitive on the global
market. Some of this is decided fairly deliberately in HPC, just as it
is in e.g. aerospace industries, defense industries -- "strategic
resource" companies get big contracts to ensure that there continue to
be strategic resource companies that can FULFILL big contracts, and that
our country's companies can build bigger faster things than your
country's companies. This is all geopolitick stuff.
Both of these things of course appear in many other venues, not just
computing. Government funded research projects, high energy physics,
and so on -- this is known as "politics" and it makes the world go round
and is a GOOD thing, as I for one would really, really not want the
world to be run masters, not EVEN if they really were all-wise and
> d) It is definitely the case that i do not see 'bad' persons on those spots,
> it's just that the average government official knows nothing from
> contracts. They have doctor and professor titles,
> because one day they were good somewhere in some domain, and perhaps
> even are today; they aren't on that
> spot because they are good in contracts. Government jobs pay based upon
> number of titles you've got,
> they don't pay based upon how good you are in closing deals and closing
> good contracts. They sit there because
> they are good in working with commissions which are good for your
> health; such commissions usually have nice meetings,
> unlike companies where meetings are usually way harder as tough
> decisions get forced there.
> Find me 1 burocrat on entire planet who is good in contracts and sits at
> the right spot to decide what happens.
Oh, you're just being cynical. Which is your god-given right, of
course. However, there are good people in government, bad people in
government, in between people... same as anywhere else, really.
Politics consists of two people with different taste trying to decide
what movie to watch together, for God's sake, and scales up from there.
It beats resolving the issue with clubs.
> I saw this in politics even more than in HPC. Those sectors
> go about a lot more money than several tens of millions. Highend
> really is about little money if you compare it with other infrastructural
> projects. In telecommunication
> i remember a meeting i had with a manager of a telecommunication company who
> nearly wanted to
> beat me up when en passant i remarked to him that such a wireless station of
> around 60 watt i would never
> want in my garden, as 60 watt straight through my head i do not consider as a
> healthy way of life.
> Even if they would offer me 20k euro a year for it, like they do to farmers
> for example.
Sigh. 60 watts is the power given off by a -- wait for it -- 60 watt
light bulb. Your head (brain) consumes roughly 1/3 of the
calories/watts your body burns -- call it 30 watts total (out of a
fairly typical 100W total output from a human body). A 60 watt light
bulb doesn't exactly keep you warm on a cold winter day from a single
meter away. Putting your head right on it, sure -- get 20 watts hitting
it from a few centimeters away and it will get warm, maybe dangerously
warm if you sit in a beam focus and hold still.
This is the shorthand reason that this is silly. If you want the long
version, I have to talk about skin-depth, frequencies, and so on. The
long and short of it is that the power differential associated with that
60W transmitter some meters away inside the tissue of your skin, let
alone your brain, is SO LOW compared to what it experiences from the SUN
while WORKING in your garden that you might as well decide to avoid all
sources of power (including your computers) for the rest of your life as
it might damage your health being warm.
Now Mr. Sun is, as a matter of fact, dangerous as all hell. Expose your
skin to HIM and you'll get radiation poisoning in short order. That's
because he gives off dangerous amounts of UV, not microwaves, and UV has
a short enough wavelength that it can affect things in a quantum way
instead of by gross absorption in the form of heat.
Directional microwaves are not without their dangers as well. 60W can
cook a hot dog if they've got noplace else to go until they are
absorbed. But omnidirectional or dipole-pattern 60W from tens of meters
away just isn't an issue.
(Which reminds me of the equally silly claim that living near a HV power
line caused cancer, in spite of computations that showed that at MOST
milliwatts of very low frequency radiation were being emitted. Maybe
from ozone or something but damn sure not from EM radiation).
> Compared to that HPC is near holy. It is not bad for health.
> It's good for health in fact as it avoids for example atomic test blasts and
> at a very cheap price too.
Oh, you mean that it enables every small country in the world to design
implosion bombs without any need to build 25,000 pits and blow them up
one at a time to develop an effective implosion design (as was done by
the US in pre-HPC WWII)? Or design thermonuclear devices, enhanced
radiation devices, and the like without first having to build an actual
implosion trigger? So that if/when they DO manage to get a reactor
going that can cook up some plutonium it is easy to build bombs "without
Yeah, very healthy;-)
Of course, if you can extract U-235 it doesn't matter since any idiot
can build a bomb out of U-235. Take two subcritical masses, one in each
hand, and slam them together as hard as you can. Boom. With a little
effort, you can design one that will go off when you aren't actually
> So in comparision to telecommunication and high voltage power and gas pipes
> where you see sometimes
> complete idiocy which is near to or equal to corruption and mixed interests;
> i'm 100% sure this isn't the case in HPC.
> The talk i had with next director looks worse than his actions
> are in fact. He's a good guy and really did do his best, but next
> conversation is typical for government.
> We were standing at a certain big supercluster. 1 meter away from it.
> In the huge sound it produced he said: "we bought this giant machine for a
> lot of money
> and i was promised at the time that we could simply
> upgrade it a few years later by putting in a new processor.
> Now we have already take it out in production by 2006,
> with by then completely outdated processors. Just upgrading those cpu's
> would still make it a powerful machine. New dual core 700Mhz processors are
> expected to
> arrive by 2005 or even 2006 and will not be socket compatible. So we got
> tricked and in
> fact lost a lot of money by buying this machine as it can't get upgraded."
> Me: "But you signed a contract which takes care you can sue them now?"
This is a common enough mistake, actually, but it isn't the mistake that
you think it is. The real mistake is in thinking that you can EVER
economically upgrade a computer system at the processor level. The
answer is ALMOST always no as little as one year later and beyond.
I made it myself around 1990 (purchasing a 2-processor SGI 220S thinking
that we could upgrade it to more processors and memory), at their
"bargain" price of some $20,000 per processor pair. When I learned,
three years later, that my $5000 desktop workstation was faster than
both processors put together (that had cost around $60,000 or
thereabouts in a refrigerator-sized unit only three years before) it was
a crash course in Moore's Law and its implications in purchasing and
designing supercomputers -- 1993 was also when I started using PVM,
through no real coincidence (I'd already gotten computations up to
twenty-odd processors the rsh-hard way over 10B2 ethernet). We sold
this machine when the annual software maintenance on its operating
system cost more than a workstation that was twice as fast -- for $3000
less shipping, to somebody who needed it because they ran software that
used its graphics engine.
There are lots of ways to AVOID making this mistake, of course. Dealing
with an honest vendor is one, as an honest vendor would never assert
that ANY COTS could be economically upgraded at the processor level two
to three years later. Hiring a consultant who isn't a complete idiot is
another (there are so MANY people on this list who are competent in this
regard). Hell, just asking the list and getting a
free-as-in-you'd-owe-me-a-beer consult from me or any of a few dozen
others on the list would have done it.
Lacking somebody with this sort of experience or knowledge helping out,
though, and using a somewhat sly vendor, yeah, well, you pays your dues
one way or another to learn. But what, again, does this have to do with
government vs corporate vs University? If it is a sin, well hey, I'm a
sinner too -- they sound so >>convincing<< when they tell you it is
upgradable, after all. And sometimes it isn't even a "lie" -- it is
just a mistake of the vendor or their sales rep, who may have THOUGHT it
would be upgradable, been TOLD it would be upgradeable. Things change,
people get them wrong sometimes. Even experience isn't a totally safe
guide -- perhaps there ARE systems out there that somebody has managed
to upgrade at the processor level without gritting their teeth and
ultimately wasting a lot of money.
> [ear deafening silence]
> Best regards,
> Director DiepSoft
> p.s. certain people in HPC only seem to react when they feel attacked or
>>> They either give the job to a friend of them (because of some weird demand
>>> that just 1 manufacturer can provide),
>>> or they have an open bid and if you can bid with a network that's $800 a
>>> port, then that bid is gonna get taken over
>>> a bid that's $1500 a port.
>> The key is to set the right requirements in your RFP. Naive RFPs would
>> use broken benchmarks like HPCC. Smart RFPs would require benchmarking
>> real application cores under reliability and performance constraints.
>> It's not that "you get what you pay for", it's "you get what you ask for
>> at the best price".
>>> This where the network is one of the important choices to make for a
>>> supercomputer. I'd argue nowadays, because
>>> the cpu's get so fast compared to latencies over networks, it's THE most
>>> important choice.
>> In the vast majority of applications in production today, I would argue
>> that it's not. Why ? Because only a subset of codes have enough
>> communications to justify a 10x increase in network cost compared to
>> basic Gigabit Ethernet. Your application is very fine grain, because it
>> does not compute much, but chess is not representative of HPC workloads...
>>> My government bought 600+ node network with infiniband and and and....
>>> dual P4 Xeons.
>> again, you don't know the whole story: you don't know the deal they got
>> on the chips, you don't know if their applications runs fast enough on
>> Xeons, you don't know if they could not have the same support service on
>> Opteron (Dell does not sell AMD for example).
>> By the way, your gouvernment is also buying Myrinet ;-)
>>> I believe personal in measuring at full system load.
>> Ok, you want to buy a 1024 nodes cluster. How do you measure at full
>> system load ? You ask to benchmark another 1024 nodes cluster ? You
>> can't, no vendor has a such a cluster ready for evaluation. Even if they
>> had one, things change so quickly in HPC, it would be obsolete very
>> quickly from a sale point of view.
>> The only way is to benchmark something smaller (256 nodes) and define
>> performance requirements at 1024 nodes. If the winning bid does not
>> match the acceptance criteria, you refuse the machine or you negociate a
>> "punitive package".
>>> The myri networks i ran on were not so good. When i asked the same big
>>> blue guy the answer was:
>>> "yes on paper it is good nah? However that's without the overhead that
>>> you practical have from network
>>> and other users".
>> Which machine, which NICs, which software ? We have 4 generations of
>> products with 2 different software interfaces, and it's all called
>> On *all* switched networks, there is a time when you share links with
>> other communications, unless you are on the same crossbar. Some sites do
>> care about process mapping (maximize job on same crossbar or same
>> switch), some don't. From the IBM guy's comment, I guess he doesn't know
>>> A network is just as good as its weakest link. With many users there is
>>> always a user that hits that weak link.
>> There is no "weak" link in modern network fabrics, but there is
>> contention. Contention is hard to manage, but there is no real way
>> around except having a full crossbar like the Earth Simulator. Clos
>> topologies (Myrinet, IB) have contention, Torus topologies (Red Storm,
>> Blue Gene) have contention, that's life. If you don't understand it, you
>> will say the network is no good.
>>> That said, i'm sure some highend Myri component will work fine too.
>>> This is the problem with *several* manufacturers basically.
>>> They usually have 1 superior switch that's nearly unaffordable, or just
>>> used for testers,
>>> and in reality they deliver a different switch/router which sucks ass, to
>>> say polite.
>>> This said without accusing any manufacturer of it.
>>> But they do it all.
>> Not often in HPC. The HPC market is so small and so low-volume, you
>> cannot take the risk to alienate customers like that, they won't come
>> back. If they don't come back, you run out of business.
>> Furthermore, the customer accepts delivery of a machine on-site, testing
>> the real thing. If it does not match the RFP requirements, they can
>> refuse it and someone will lose a lot of money. It has happened many
>> times. It's not like the IT business when you buy something based on
>> third-party reviews and/or on specsheet. Some do that in HPC, and they
>> get what they deserve, but believe me, most don't.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf