[Beowulf] Re: vectors vs. loops
diep at xs4all.nl
Wed May 4 07:14:42 PDT 2005
At 09:37 AM 5/4/2005 -0400, Robert G. Brown wrote:
>On Wed, 4 May 2005, Vincent Diepeveen wrote:
>> At 07:18 PM 5/3/2005 -0700, Greg Lindahl wrote:
>> >On Wed, May 04, 2005 at 02:40:17AM +0200, Vincent Diepeveen wrote:
>> >> In fact all that complaints about memory bandwidth IMHO is a lack of
>> >> understanding how hardware works.
>> >Allow me to be the first to congratulate you for being so willing to
>> >spend your valuable time to educate us all, and in fact the entire HPC
>> >-- greg
>> If an itanium2 montecito dual core clocks 2ghz and can theoretically
>> deliver therefore roughly 2 * 4 * 4 = 16 gflops, it's undeniable that this
>> means that cell starts at a potential of at least 16 * 8 = 108 gflops.
>> Having quite a bit more execution units IBM claims 256-320 gflop in fact.
>> Personally i believe in progress and that each few years faster cpu's come
>> on the market. Not trusting the companies to produce faster processors is
>> completely illogic.
>> Denying the fact that a cell processor starts at a factor 8 advantage in
>> number of gflops, simply because it is clocked nearly 2 times higher having
>> 4 times more cores, is very unsportmanship.
>> What matters is what happens at the chip, just like network bandwidth
>> doesn't stop todays matrix calculations from using the full potential of
>> the todays chips, memory bandwidth obviously won't stop a CELL processor
>Vincent, I think that what Greg is TRYING to say here in his typical
>understated fashion is that (in the words of Inigo Montoya in "The
>Princess Bride") "I do not think that this word means what you think it
>means" when you use the word "bandwidth". You assert that memory
>latency and bandwidth are not critical bottlenecks in modern system
>design and that solving this bottleneck in system design is "easy"
>(within the capabilities of a very mediocre engineer, I think were your
>This is simply not the case. Not even close. For example, see the real
>which presents the ratio between peak MFLOPS and peak memory bandwidth
>on the world's most expensive (and fastest) hardware. Note that ALL of
>these systems are bottlenecked by bandwidth, in many cases by a factor
>of 20 or more. The ones with the smallest bottlenecks are, um,
>"expensive" as one might expect.
Those bottlenecks are derived because they use 1000s of chickens.
When the cell idea was revealed i also didn't give it much attention, until
it was reviewed by a few big leading hardware experts.
Do not forget at which company they work when reading their articles.
If an intel guy says about a new amd processor: "i have doubts" that means:
"it is so good that i couldn't find any weakness to post here otherwise i
would have mentionned my doubts".
If an ibm guy says about an intel design: "how are they going to produce it
That basically tells us there is no weak spot in the design findable.
In fact i realized very well that opteron would kick butt when i followed a
discussion of a few intel engineers versus a few amd engineers.
The comments of the intel engineers was: "it won't clock above 2.6Ghz"
The comments of the amd folks: "we will make 2.8Ghz".
At that moment i realized that, whether it would be 2.6Ghz or 2.8Ghz, it
would kick MAJOR butt, because ipc times clockrate gives a huge performance.
Please note that they reached 2.6Ghz, not 2.8Ghz.
So let's adress now memory problem some say cell processor has. Of course
i'll give the word to an engineer there. Just read his 2 reviews :
part 2, especially about the memory connection to the chip:
Please realize how pro-intel that entire site is when reading reviews there.
"However, as mentioned previously, the success and failure of the CELL
processor to extend outside the domain of the game console depends not on
the capability of the hardware to produce more FLOPS, but instead on the
yet unknown and unproven software stack."
In short, he coudln't find any hardware problem , either execution
technical related, nor memory bandwidth related.
He shifts the problem to software.
That's good, that tells me the processor is excellent.
Now let's adress your comments:
"Note that ALL of
these systems are bottlenecked by bandwidth, in many cases by a factor
of 20 or more. The ones with the smallest bottlenecks are, um,
"expensive" as one might expect."
Yes obviously if you cross ship a lot of data over all the routers in
clusters and supercomputers, the bottleneck is the latency and the
bandwidth the network gives.
That's why i keep on posting that it's very interesting to have a processor
which can on its own internally process 256 gflop.
You shift the bandwidth problem of the expensive network in that case to
the processor itself.
I find that a very good idea. I'm no expert on making mainboards, but from
an expert i understood that a major problem in communication speeds is that
above 500Mhz the cupper wires start working as a kind of a 'transistor
So again big reason to put things inside a chip, rather than just focus
upon making 500 mflop processors by the zillion and ordering a network for
Everything that gets done inside the chips caches means the network doesn't
Ideal is of course 1 chip that is doing entire calculation for you.
That makes it a lot easier.
It's far easier to put 8 cores onto 1 chip with each core a huge bandwidth
to another core, than it is to make a high bandwidth network.
>Now either solving the memory bandwidth bottleneck isn't quite as "easy"
>as you make it out to be or SGI, NEC, Cray, HP, IBM, Sun and Fujitsu all
>are hiring incompentent engineers these days, given that these are all
Money always is important.
Just compare price. I do not know many highend switches for 8 nodes that
are under $5000. An entire network set is thousands of dollars.
If you print that at 1 chip, please realize that printing a single wafer,
say 9 layers, is typically a few thousands of dollars, delivering a few
hundreds of chips.
So obviously the *production price* of a chip is always cheaper and
therefore much easier, than producing the network equipment.
Nothing as hard as a good network. Things inside chips is far easier and
convenient than to do it at a big network.
Because that's what we're talking about here. Either have 1 machine with
say a quad 4.0Ghz cell processor, delivering 1 tflop, or have a 16 node
quad montecito 2.0Ghz.
That's about 2 of the choices you have.
What will be cheaper to buy in the store?
>This table, interesting as it is, isn't as relevant to COTS designers
>If you look here at results for off-the-shelf hardware you will observe
>that without exception the CPU is at LEAST 8x faster (and only that
>small for a single modern system, the bleeding edge Opteron 848) and
>more commonly 20-50x faster than the memory.
My argument is that this huge speed difference doesn't stop anyone from
doing its matrix calculations at huge supercomputers now, and if a fast
vector processor is there, the memory won't stop the matrix calculation
from being done at the speed the processor can deliver either.
>So it's really kind of silly to claim that there is no point in being
Not the claim is silly, silly is the assumption that cell will fail
"because it has not enough memory bandwidth"
or "because it is too fast for the network"
It didn't stop matrix calculations in past either from getting done at
optimal processor speed, the memory being tens of times slower.
So my argument still is the same: a good processor makes the difference
more than the memory.
You might want to approach it from another direction why a good processor
is so important.
If you simply study the hitrates the L1 and L2 cache get at the average
software, then it's trivial that the streaming from main memory is very
slow, but it just happens in a very tiny fraction of the cases that you
need that main memory.
Those caches in todays processors really work very well!
>concerned by a memory bottleneck, since a very simple, very standard
>microbenchmark tool based on instructions that are fairly common
>components of various linear transformations demonstrates that not only
>there is a bottleneck but that the bottleneck slows systems down by a
>full order of magnitude relative to what even boring off-the-shelf CPUs
>>>can<< do in terms of floating point rates when they are working on
>data already in the CPU registers.
>Putting a higher clock CPU, or a vector CPU, or more pipelines and
>fancier/faster floating point instructions on the CPU on any of these
>systems can easily be a waste of time for these classes of code.
There are very capable programmers, show them 1 thing that is real fast at
the processor and they will rewrite the applications such that they make it
perform fast using that trick.
>Instead of making unfounded statements about advanced electrical
>engineering, it might be useful for you to familiarize yourself with
>tools that actually measure the VERY REAL bottlenecks that are out there
>so you can learn about code organizations and so forth that might help
>you work around them a la ATLAS. That's the real trick -- ATLAS works
>because it blocks out a linear algebra computation so that it runs out
>of cache as much as possible. It is a brilliant design, and an example
>of the significant benefits to be gained by tuning for a hardware
>architecture for some classes of code.
Game tree search is among the most optimized software on the planet, my own
code is no exception there.
In fact Diep has the best scaling of any chessprogram at multiprocessor
hardware and the best speedup.
So who must teach who how to program?
>How much of a CPU's native speed one gets in any given computation
>depends upon the organization of that computation and the ratio of how
>much computation it does to how much data it requires, just like it does
>in a parallelized computation (where the critical bottleneck is more
>often how much data it requires that has to come over the network).
>If you do enough computation on each data object fetched from memory,
>and fetch those objects in an order that permits cache-optimizing things
>like prefetch to work to your advantage, you may see no memory
>bottleneck at all in your code. If you are spending very LITTLE time
>doing computations on each object and/or are accessing the data in an
>order that defeats cache, you might well see a huge bottleneck relative
>to the CPU's real (not theoretical) peak rates under ideal
>circumstances, e.g. operations on values already in CPU registers.
>It is really instructive to run a microbenchmark that measures random
>memory access rates compared to streaming to fully illustrate the latter
>point (and show how it depends on architecture). When I run my integer
>rw test in shuffled and unshuffled/streaming order, for example, I see a
>split in rates of a factor of 3-5 even on vectors that fit in L2 cache
>on a P4 (with shuffled slower, of course). On an AMD-64 shuffled vs
>streaming is absolutely flat in L2 (kind of impressive, actually) and
>there is only a small drop-off as one streams out of main memory.
>Random out of main memory, OTOH, drops to the floor (although it remains
>above the P4's, which drops THROUGH the floor). In L2 the performance
>is consistently faster than that of the P4 as well for either kind of
>For that reason, I'd expect certain classes of computation to perform
>much better on an AMD 64 with suitable blocking than they would on a P4.
>Not surprising, but it is useful and entertaining to be able to SEE it
>and not theorize about it.
>In the meantime, I doubt that you'll convince the many HPC enthusiasts
>on this list that there is no memory bandwidth bottleneck and it is all
>a figment of their imagination or the result of openly incompetent
>engineering. Too many of us live with code that is at least partially
>slowed by that very thing, too many consumers would cheerfully
>differentially select systems with a lower bottleneck if it were
>possible to do so at constant cost. All of us, in fact, are affected in
>all probability, given that the problem extends even to L2 (bottlenecked
>relative to on-chip registers, although remarkably less on the new
>In the meantime, we all love faster CPUs and higher clocks because cache
>DOES work for lots of classes of code and many of us (myself included)
>DO enough work per memory access that CPU clock is the primary
>bottleneck and not memory. Just remember that people's code and the
>corresponding optimal system balance tends to be different. A system
>balance in the range of 10-30 is actually just fine for a lot of folks'
>code. Spending 10-100x more for a system to reduce it to 1-3 won't
>really make things all that much faster for them, unless peak CPU
>increases at the same time, and the cost-benefit has to be carefully
>considered. The result has to be VALUABLE to justify spending that much
>extra money just to get a result a bit earlier or faster.
>> Now we must see of course whether IBM can deliver it cheaply and in time,
>> or whether other manufacturers will show up sooner with a similar design
>> that's faster and above all cheaper. That still doesn't take away the
>> advantage a chip tens of times faster delivers is obvious when we realize
>> that far over 50% of all supercomputer system time goes to matrix type
>> calculations. All of which are embarrassingly parallel and easy
>> In fact the libraries are already vectorizing it more or less. It's just
>> doing more of the same thing faster now.
>> Denying that here is very unsportmanship towards all companies making
>> highend chips.
>> >Beowulf mailing list, Beowulf at beowulf.org
>> >To change your subscription (digest mode or unsubscribe) visit
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>Robert G. Brown http://www.phy.duke.edu/~rgb/
>Duke University Dept. of Physics, Box 90305
>Durham, N.C. 27708-0305
>Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf