[Beowulf] Re: vectors vs. loops

Wed May 4 10:24:30 PDT 2005

On Wed, 4 May 2005, Vincent Diepeveen wrote:

> >This is simply not the case.  Not even close.  For example, see the real
> >numbers in:
> >
> >  http://www.cs.virginia.edu/stream/top20/Balance.html
> >
> >which presents the ratio between peak MFLOPS and peak memory bandwidth
> >on the world's most expensive (and fastest) hardware.  Note that ALL of
> >these systems are bottlenecked by bandwidth, in many cases by a factor
> >of 20 or more.  The ones with the smallest bottlenecks are, um,
> >"expensive" as one might expect.
> 
> Those bottlenecks are derived because they use 1000s of chickens.

There are a variety of architectures represented, but these systems are
not COTS clusters which is what I'd take the chickens metaphor to
represent.  The basic point is that it is not cheap to match floating
point capacity to memory bandwidth for streaming data.

....

> So let's adress now memory problem some say cell processor has. Of course
> i'll give the word to an engineer there. Just read his 2 reviews :
> part one:
>   http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318
> 
> part 2, especially about the memory connection to the chip:
>   http://www.realworldtech.com/page.cfm?ArticleID=RWT022805234129

The reviews aren't important; the technical description is and is very
interesting.  The claim is that the dual rambus interface will provide
25 GB/sec bandwidth to a "processor" with 8 SIMD floating point units
capable of producing about 25 GFLOPs peak (forget the "250 GFLOPS" --
only IEEE compliant double precision arithmetic counts in HPC).  This is
an imbalance of roughly 8:1 for double precision data movement, although
that depends somewhat on the answer to the eternal question "what's a
FLOPs" without referring it to a specific benchmark and context.  It
could easily be twice that, or more.  In other words, it might deliver
a real-world throughput of 3-4 GFLOPS acting on real streaming data.
Impressive and lovely, to be sure, but not quite as impressive as "250
GFLOPS" and a lot more in line with what one can expect based on
existing memory bandwidth bottlenecks.

It will be interesting to see what kind of stream numbers are produced
and what various microbenchmarks yield when real systems are out there
to play with.

> You shift the bandwidth problem of the expensive network in that case to
> the processor itself.
> 
> I find that a very good idea. I'm no expert on making mainboards, but from
> an expert i understood that a major problem in communication speeds is that
> above 500Mhz the cupper wires start working as a kind of a 'transistor
> radio antenna'. 
> 
> So again big reason to put things inside a chip, rather than just focus
> upon making 500 mflop processors by the zillion and ordering a network for
> 1048576 processors.
> 
> Everything that gets done inside the chips caches means the network doesn't
> get stressed.

There are oh so many problems, and you're just scraping the surface.
Most of the problems have "obvious" solutions until you realize that the
engineering is all about tradeoffs, and that the overriding constraints
are heat production and the number of transistors one can cram into a
single chip.  The really interesting thing is that all the CPU
manufacturers tend to work on a fairly level playing field here --
within a factor of two or three.  There is definitely room for
innovation and good ideas here, and the cell sponsors are definitely
capable of some good engineering, but EVERYBODY robs some peter to pay
this or that paul, and EVERYBODY is dealing with the cruel fact that
memory bandwidth to DRAM is far smaller than the peak capabilities of
what they put onto the chips already.  So one is pretty much gambling on
the kind of work one's customers will want to get done efficiently
(often at the expense of work that is done more rarely but is still
important).

Intel and AMD both have their roadmaps as well, and both of them are
expecting to ramp up CPU capabilities and are doing various things that
they hope will allow the memory bandwidth problem to be "managed" (avoid
bottlenecking the "important" code, which may or may not be SIMD
floating point stuff).  Competition is simply lovely for the consumer.
It is also interesting to observe the relative weight Intel, AMD and the
Cell design place on DP floating point.

> Ideal is of course 1 chip that is doing entire calculation for you.

Sure, but one then builds a cluster out of THAT chip.  The tradeoffs
between serial speed of a single CPU and performance of a "cluster"
(including an SMP machine) and Amdahl's law and generalizations are all
well known.  

> If you simply study the hitrates the L1 and L2 cache get at the average
> software, then it's trivial that the streaming from main memory is very
> slow, but it just happens in a very tiny fraction of the cases that you
> need that main memory.
> 
> Those caches in todays processors really work very well!

...for some code.  For some code.  For some code.  For MY code, for that
matter.  But not for all code.  That's what the "balance" is all about.
Different kinds of code can tolerate different ratios between peak CPU
speed and memory speed without bottlenecking.

> >Putting a higher clock CPU, or a vector CPU, or more pipelines and
> >fancier/faster floating point instructions on the CPU on any of these
> >systems can easily be a waste of time for these classes of code.
> 
> There are very capable programmers, show them 1 thing that is real fast at
> the processor and they will rewrite the applications such that they make it
> perform fast using that trick.

Sigh.  Look, I AGREE that fast CPUs are lovely and having a high CPU
peak speed at this or that can be very beneficial and that code will be
written that uses the speed.  Or to put it another way, if you do enough
computation of ANY sort per data retrieval you can do that computation
at a rate close to the CPUs internal speed.  What you don't seem to be
willing to acknowledge is that there are pieces of work that people
perform that CANNOT be sped up by using "tricks" involving the CPU's
theoretical peak speeds -- the data just isn't going to be there to
operate on any faster than main memory can deliver it.

You also seem to be ignoring my remarks that certain memory access
patterns will make the world's fastest CPUs perform like pigs.
Streaming access patterns are really not the "best" ones to illustrate
the CPU-memory bottleneck because they are the ones people HAVE worked
hard to minimize with prefetch, parallel DMA transfers, sophisticated
memory controllers and branch predictors, and all sorts of other crap.
Feed an algorithm data from random chunks of main memory and all that
fancy stuff sits twiddling its metaphorical thumbs waiting for mail
memory to deliver at a full latency hit.

> >Instead of making unfounded statements about advanced electrical
> >engineering, it might be useful for you to familiarize yourself with
> >tools that actually measure the VERY REAL bottlenecks that are out there
> >so you can learn about code organizations and so forth that might help
> >you work around them a la ATLAS.  That's the real trick -- ATLAS works
> >because it blocks out a linear algebra computation so that it runs out
> >of cache as much as possible.  It is a brilliant design, and an example
> >of the significant benefits to be gained by tuning for a hardware
> >architecture for some classes of code.
> 
> Game tree search is among the most optimized software on the planet, my own
> code is no exception there.

I have no idea what "the most optimized software on the planet means",
given that optimized is optimized.

> In fact Diep has the best scaling of any chessprogram at multiprocessor
> hardware and the best speedup. 
> 
> So who must teach who how to program?

I'm not trying to teach you to program.  I don't know what I'm trying to
do anymore -- obviously nothing constructive. I suppose I was trying to
get you to temper remarks that implied that all the world's computer
systems engineers and chipset designers were idiots because you had had
the word -- that it was "easy" to solve the memory bandwidth bottleneck
problem, or that there was no such problem.  I personally don't believe
that those engineers are idiots and can see the problem for myself with
microbenchmarking tools on at least all the systems I have access to.
Rather the opposite.

So instead I'll just retire from the discussion, since there is clearly
no useful purpose to it.

    rgb

> 
> Vincent
> 
> >How much of a CPU's native speed one gets in any given computation
> >depends upon the organization of that computation and the ratio of how
> >much computation it does to how much data it requires, just like it does
> >in a parallelized computation (where the critical bottleneck is more
> >often how much data it requires that has to come over the network).
> >
> >If you do enough computation on each data object fetched from memory,
> >and fetch those objects in an order that permits cache-optimizing things
> >like prefetch to work to your advantage, you may see no memory
> >bottleneck at all in your code.  If you are spending very LITTLE time
> >doing computations on each object and/or are accessing the data in an
> >order that defeats cache, you might well see a huge bottleneck relative
> >to the CPU's real (not theoretical) peak rates under ideal
> >circumstances, e.g. operations on values already in CPU registers.
> >
> >It is really instructive to run a microbenchmark that measures random
> >memory access rates compared to streaming to fully illustrate the latter
> >point (and show how it depends on architecture).  When I run my integer
> >rw test in shuffled and unshuffled/streaming order, for example, I see a
> >split in rates of a factor of 3-5 even on vectors that fit in L2 cache
> >on a P4 (with shuffled slower, of course).  On an AMD-64 shuffled vs
> >streaming is absolutely flat in L2 (kind of impressive, actually) and
> >there is only a small drop-off as one streams out of main memory.
> >Random out of main memory, OTOH, drops to the floor (although it remains
> >above the P4's, which drops THROUGH the floor).  In L2 the performance
> >is consistently faster than that of the P4 as well for either kind of
> >access.
> >
> >For that reason, I'd expect certain classes of computation to perform
> >much better on an AMD 64 with suitable blocking than they would on a P4.
> >Not surprising, but it is useful and entertaining to be able to SEE it
> >and not theorize about it.
> >
> >In the meantime, I doubt that you'll convince the many HPC enthusiasts
> >on this list that there is no memory bandwidth bottleneck and it is all
> >a figment of their imagination or the result of openly incompetent
> >engineering.  Too many of us live with code that is at least partially
> >slowed by that very thing, too many consumers would cheerfully
> >differentially select systems with a lower bottleneck if it were
> >possible to do so at constant cost.  All of us, in fact, are affected in
> >all probability, given that the problem extends even to L2 (bottlenecked
> >relative to on-chip registers, although remarkably less on the new
> >AMDs).
> >
> >In the meantime, we all love faster CPUs and higher clocks because cache
> >DOES work for lots of classes of code and many of us (myself included)
> >DO enough work per memory access that CPU clock is the primary
> >bottleneck and not memory.  Just remember that people's code and the
> >corresponding optimal system balance tends to be different.  A system
> >balance in the range of 10-30 is actually just fine for a lot of folks'
> >code.  Spending 10-100x more for a system to reduce it to 1-3 won't
> >really make things all that much faster for them, unless peak CPU
> >increases at the same time, and the cost-benefit has to be carefully
> >considered.  The result has to be VALUABLE to justify spending that much
> >extra money just to get a result a bit earlier or faster.
> >
> >    rgb
> >
> >> 
> >> Now we must see of course whether IBM can deliver it cheaply and in time,
> >> or whether other manufacturers will show up sooner with a similar design
> >> that's faster and above all cheaper. That still doesn't take away the
> >> advantage a chip tens of times faster delivers is obvious when we realize
> >> that far over 50% of all supercomputer system time goes to matrix type
> >> calculations. All of which are embarrassingly parallel and easy
> vectorizable.
> >> 
> >> In fact the libraries are already vectorizing it more or less. It's just
> >> doing more of the same thing faster now.
> >> 
> >> Denying that here is very unsportmanship towards all companies making
> >> highend chips.
> >> 
> >> Vincent
> >> 
> >> >
> >> >_______________________________________________
> >> >Beowulf mailing list, Beowulf at beowulf.org
> >> >To change your subscription (digest mode or unsubscribe) visit
> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >> >
> >> >
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org
> >> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >> 
> >
> >-- 
> >Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> >Duke University Dept. of Physics, Box 90305
> >Durham, N.C. 27708-0305
> >Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> >
> >
> >
> >
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu