[Beowulf] Re: vectors vs. loops
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed May 4 06:37:51 PDT 2005
- Previous message: [Beowulf] Re: vectors vs. loops
- Next message: [Beowulf] Re: vectors vs. loops
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 4 May 2005, Vincent Diepeveen wrote: > At 07:18 PM 5/3/2005 -0700, Greg Lindahl wrote: > >On Wed, May 04, 2005 at 02:40:17AM +0200, Vincent Diepeveen wrote: > > > >> In fact all that complaints about memory bandwidth IMHO is a lack of > >> understanding how hardware works. > > > >Allow me to be the first to congratulate you for being so willing to > >spend your valuable time to educate us all, and in fact the entire HPC > >community. > >-- greg > > If an itanium2 montecito dual core clocks 2ghz and can theoretically > deliver therefore roughly 2 * 4 * 4 = 16 gflops, it's undeniable that this > means that cell starts at a potential of at least 16 * 8 = 108 gflops. > > Having quite a bit more execution units IBM claims 256-320 gflop in fact. > > Personally i believe in progress and that each few years faster cpu's come > on the market. Not trusting the companies to produce faster processors is > completely illogic. > > Denying the fact that a cell processor starts at a factor 8 advantage in > number of gflops, simply because it is clocked nearly 2 times higher having > 4 times more cores, is very unsportmanship. > > What matters is what happens at the chip, just like network bandwidth > doesn't stop todays matrix calculations from using the full potential of > the todays chips, memory bandwidth obviously won't stop a CELL processor > either. Vincent, I think that what Greg is TRYING to say here in his typical understated fashion is that (in the words of Inigo Montoya in "The Princess Bride") "I do not think that this word means what you think it means" when you use the word "bandwidth". You assert that memory latency and bandwidth are not critical bottlenecks in modern system design and that solving this bottleneck in system design is "easy" (within the capabilities of a very mediocre engineer, I think were your words). This is simply not the case. Not even close. For example, see the real numbers in: http://www.cs.virginia.edu/stream/top20/Balance.html which presents the ratio between peak MFLOPS and peak memory bandwidth on the world's most expensive (and fastest) hardware. Note that ALL of these systems are bottlenecked by bandwidth, in many cases by a factor of 20 or more. The ones with the smallest bottlenecks are, um, "expensive" as one might expect. Now either solving the memory bandwidth bottleneck isn't quite as "easy" as you make it out to be or SGI, NEC, Cray, HP, IBM, Sun and Fujitsu all are hiring incompentent engineers these days, given that these are all money-is-no-object systems. This table, interesting as it is, isn't as relevant to COTS designers as: http://www.cs.virginia.edu/stream/standard/Balance.html If you look here at results for off-the-shelf hardware you will observe that without exception the CPU is at LEAST 8x faster (and only that small for a single modern system, the bleeding edge Opteron 848) and more commonly 20-50x faster than the memory. So it's really kind of silly to claim that there is no point in being concerned by a memory bottleneck, since a very simple, very standard microbenchmark tool based on instructions that are fairly common components of various linear transformations demonstrates that not only there is a bottleneck but that the bottleneck slows systems down by a full order of magnitude relative to what even boring off-the-shelf CPUs >>can<< do in terms of floating point rates when they are working on data already in the CPU registers. Putting a higher clock CPU, or a vector CPU, or more pipelines and fancier/faster floating point instructions on the CPU on any of these systems can easily be a waste of time for these classes of code. Instead of making unfounded statements about advanced electrical engineering, it might be useful for you to familiarize yourself with tools that actually measure the VERY REAL bottlenecks that are out there so you can learn about code organizations and so forth that might help you work around them a la ATLAS. That's the real trick -- ATLAS works because it blocks out a linear algebra computation so that it runs out of cache as much as possible. It is a brilliant design, and an example of the significant benefits to be gained by tuning for a hardware architecture for some classes of code. How much of a CPU's native speed one gets in any given computation depends upon the organization of that computation and the ratio of how much computation it does to how much data it requires, just like it does in a parallelized computation (where the critical bottleneck is more often how much data it requires that has to come over the network). If you do enough computation on each data object fetched from memory, and fetch those objects in an order that permits cache-optimizing things like prefetch to work to your advantage, you may see no memory bottleneck at all in your code. If you are spending very LITTLE time doing computations on each object and/or are accessing the data in an order that defeats cache, you might well see a huge bottleneck relative to the CPU's real (not theoretical) peak rates under ideal circumstances, e.g. operations on values already in CPU registers. It is really instructive to run a microbenchmark that measures random memory access rates compared to streaming to fully illustrate the latter point (and show how it depends on architecture). When I run my integer rw test in shuffled and unshuffled/streaming order, for example, I see a split in rates of a factor of 3-5 even on vectors that fit in L2 cache on a P4 (with shuffled slower, of course). On an AMD-64 shuffled vs streaming is absolutely flat in L2 (kind of impressive, actually) and there is only a small drop-off as one streams out of main memory. Random out of main memory, OTOH, drops to the floor (although it remains above the P4's, which drops THROUGH the floor). In L2 the performance is consistently faster than that of the P4 as well for either kind of access. For that reason, I'd expect certain classes of computation to perform much better on an AMD 64 with suitable blocking than they would on a P4. Not surprising, but it is useful and entertaining to be able to SEE it and not theorize about it. In the meantime, I doubt that you'll convince the many HPC enthusiasts on this list that there is no memory bandwidth bottleneck and it is all a figment of their imagination or the result of openly incompetent engineering. Too many of us live with code that is at least partially slowed by that very thing, too many consumers would cheerfully differentially select systems with a lower bottleneck if it were possible to do so at constant cost. All of us, in fact, are affected in all probability, given that the problem extends even to L2 (bottlenecked relative to on-chip registers, although remarkably less on the new AMDs). In the meantime, we all love faster CPUs and higher clocks because cache DOES work for lots of classes of code and many of us (myself included) DO enough work per memory access that CPU clock is the primary bottleneck and not memory. Just remember that people's code and the corresponding optimal system balance tends to be different. A system balance in the range of 10-30 is actually just fine for a lot of folks' code. Spending 10-100x more for a system to reduce it to 1-3 won't really make things all that much faster for them, unless peak CPU increases at the same time, and the cost-benefit has to be carefully considered. The result has to be VALUABLE to justify spending that much extra money just to get a result a bit earlier or faster. rgb > > Now we must see of course whether IBM can deliver it cheaply and in time, > or whether other manufacturers will show up sooner with a similar design > that's faster and above all cheaper. That still doesn't take away the > advantage a chip tens of times faster delivers is obvious when we realize > that far over 50% of all supercomputer system time goes to matrix type > calculations. All of which are embarrassingly parallel and easy vectorizable. > > In fact the libraries are already vectorizing it more or less. It's just > doing more of the same thing faster now. > > Denying that here is very unsportmanship towards all companies making > highend chips. > > Vincent > > > > >_______________________________________________ > >Beowulf mailing list, Beowulf at beowulf.org > >To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Re: vectors vs. loops
- Next message: [Beowulf] Re: vectors vs. loops
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
