[Beowulf] Re: vectors vs. loops

Wed May 4 06:37:51 PDT 2005

On Wed, 4 May 2005, Vincent Diepeveen wrote:

> At 07:18 PM 5/3/2005 -0700, Greg Lindahl wrote:
> >On Wed, May 04, 2005 at 02:40:17AM +0200, Vincent Diepeveen wrote:
> >
> >> In fact all that complaints about memory bandwidth IMHO is a lack of
> >> understanding how hardware works.
> >
> >Allow me to be the first to congratulate you for being so willing to
> >spend your valuable time to educate us all, and in fact the entire HPC
> >community.
> >-- greg
> 
> If an itanium2 montecito dual core clocks 2ghz and can theoretically
> deliver therefore roughly 2 * 4 * 4 = 16 gflops, it's undeniable that this
> means that cell starts at a potential of at least 16 * 8 = 108 gflops.
> 
> Having quite a bit more execution units IBM claims 256-320 gflop in fact.
> 
> Personally i believe in progress and that each few years faster cpu's come
> on the market. Not trusting the companies to produce faster processors is
> completely illogic.
> 
> Denying the fact that a cell processor starts at a factor 8 advantage in
> number of gflops, simply because it is clocked nearly 2 times higher having
> 4 times more cores, is very unsportmanship.
> 
> What matters is what happens at the chip, just like network bandwidth
> doesn't stop todays matrix calculations from using the full potential of
> the todays chips, memory bandwidth obviously won't stop a CELL processor
> either.

Vincent, I think that what Greg is TRYING to say here in his typical
understated fashion is that (in the words of Inigo Montoya in "The
Princess Bride") "I do not think that this word means what you think it
means" when you use the word "bandwidth".  You assert that memory
latency and bandwidth are not critical bottlenecks in modern system
design and that solving this bottleneck in system design is "easy"
(within the capabilities of a very mediocre engineer, I think were your
words).  

This is simply not the case.  Not even close.  For example, see the real
numbers in:

  http://www.cs.virginia.edu/stream/top20/Balance.html

which presents the ratio between peak MFLOPS and peak memory bandwidth
on the world's most expensive (and fastest) hardware.  Note that ALL of
these systems are bottlenecked by bandwidth, in many cases by a factor
of 20 or more.  The ones with the smallest bottlenecks are, um,
"expensive" as one might expect.

Now either solving the memory bandwidth bottleneck isn't quite as "easy"
as you make it out to be or SGI, NEC, Cray, HP, IBM, Sun and Fujitsu all
are hiring incompentent engineers these days, given that these are all
money-is-no-object systems.

This table, interesting as it is, isn't as relevant to COTS designers
as:

  http://www.cs.virginia.edu/stream/standard/Balance.html

If you look here at results for off-the-shelf hardware you will observe
that without exception the CPU is at LEAST 8x faster (and only that
small for a single modern system, the bleeding edge Opteron 848) and
more commonly 20-50x faster than the memory.

So it's really kind of silly to claim that there is no point in being
concerned by a memory bottleneck, since a very simple, very standard
microbenchmark tool based on instructions that are fairly common
components of various linear transformations demonstrates that not only
there is a bottleneck but that the bottleneck slows systems down by a
full order of magnitude relative to what even boring off-the-shelf CPUs
>>can<< do in terms of floating point rates when they are working on
data already in the CPU registers.

Putting a higher clock CPU, or a vector CPU, or more pipelines and
fancier/faster floating point instructions on the CPU on any of these
systems can easily be a waste of time for these classes of code.

Instead of making unfounded statements about advanced electrical
engineering, it might be useful for you to familiarize yourself with
tools that actually measure the VERY REAL bottlenecks that are out there
so you can learn about code organizations and so forth that might help
you work around them a la ATLAS.  That's the real trick -- ATLAS works
because it blocks out a linear algebra computation so that it runs out
of cache as much as possible.  It is a brilliant design, and an example
of the significant benefits to be gained by tuning for a hardware
architecture for some classes of code.

How much of a CPU's native speed one gets in any given computation
depends upon the organization of that computation and the ratio of how
much computation it does to how much data it requires, just like it does
in a parallelized computation (where the critical bottleneck is more
often how much data it requires that has to come over the network).

If you do enough computation on each data object fetched from memory,
and fetch those objects in an order that permits cache-optimizing things
like prefetch to work to your advantage, you may see no memory
bottleneck at all in your code.  If you are spending very LITTLE time
doing computations on each object and/or are accessing the data in an
order that defeats cache, you might well see a huge bottleneck relative
to the CPU's real (not theoretical) peak rates under ideal
circumstances, e.g. operations on values already in CPU registers.

It is really instructive to run a microbenchmark that measures random
memory access rates compared to streaming to fully illustrate the latter
point (and show how it depends on architecture).  When I run my integer
rw test in shuffled and unshuffled/streaming order, for example, I see a
split in rates of a factor of 3-5 even on vectors that fit in L2 cache
on a P4 (with shuffled slower, of course).  On an AMD-64 shuffled vs
streaming is absolutely flat in L2 (kind of impressive, actually) and
there is only a small drop-off as one streams out of main memory.
Random out of main memory, OTOH, drops to the floor (although it remains
above the P4's, which drops THROUGH the floor).  In L2 the performance
is consistently faster than that of the P4 as well for either kind of
access.

For that reason, I'd expect certain classes of computation to perform
much better on an AMD 64 with suitable blocking than they would on a P4.
Not surprising, but it is useful and entertaining to be able to SEE it
and not theorize about it.

In the meantime, I doubt that you'll convince the many HPC enthusiasts
on this list that there is no memory bandwidth bottleneck and it is all
a figment of their imagination or the result of openly incompetent
engineering.  Too many of us live with code that is at least partially
slowed by that very thing, too many consumers would cheerfully
differentially select systems with a lower bottleneck if it were
possible to do so at constant cost.  All of us, in fact, are affected in
all probability, given that the problem extends even to L2 (bottlenecked
relative to on-chip registers, although remarkably less on the new
AMDs).

In the meantime, we all love faster CPUs and higher clocks because cache
DOES work for lots of classes of code and many of us (myself included)
DO enough work per memory access that CPU clock is the primary
bottleneck and not memory.  Just remember that people's code and the
corresponding optimal system balance tends to be different.  A system
balance in the range of 10-30 is actually just fine for a lot of folks'
code.  Spending 10-100x more for a system to reduce it to 1-3 won't
really make things all that much faster for them, unless peak CPU
increases at the same time, and the cost-benefit has to be carefully
considered.  The result has to be VALUABLE to justify spending that much
extra money just to get a result a bit earlier or faster.

    rgb

> 
> Now we must see of course whether IBM can deliver it cheaply and in time,
> or whether other manufacturers will show up sooner with a similar design
> that's faster and above all cheaper. That still doesn't take away the
> advantage a chip tens of times faster delivers is obvious when we realize
> that far over 50% of all supercomputer system time goes to matrix type
> calculations. All of which are embarrassingly parallel and easy vectorizable.
> 
> In fact the libraries are already vectorizing it more or less. It's just
> doing more of the same thing faster now.
> 
> Denying that here is very unsportmanship towards all companies making
> highend chips.
> 
> Vincent
> 
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu