[Beowulf] Re: vectors vs. loops
diep at xs4all.nl
Tue May 3 17:40:17 PDT 2005
At 06:03 PM 5/3/2005 +0200, Philippe Blaise wrote:
>Robert G. Brown wrote:
>>Still, the marketplace speaks for itself. It doesn't argue, and isn't
>>legendary, it just is.
>But, does the hpc marketplace have a direction ?
>Few years ago, some people had a "fantastic vision" to replace the
>vector machines market :
>use big clusters of SMPs with the help of the new paradigm of hybrid
>Then the main vendors (usa), except Cray, were very happy to sell giant
>clusters of smp machines.
>Nevertheless, the japanese guys built the "earth simulator" ; which is
>still the most powerful machine in the world
>(don't trust this stupid top500 list).
>Then Cray came back ... with vector machines...
>Don't underestimate the power of vector machines.
>Yes Fujitsu or NEC vector machines are still very efficient, even with
>non contiguous memory access (!!).
>One year ago, the only cpus that sometimes were able to equal vectorial
>cpus were alpha (ev7) and itanum2 with
>big caches and / or fast memory access. Remember that alpha is dead.
>Have a look to the itanium2 market shares.
>The marketplace is not a good argument at all.
>Vectorization and parallelization are compatible
>Hybrid mpi/openmp programming is a harder task than mpi/vector programming.
>If you have enough money and if your program is vectorizable, buy a
>vector machine of course.
>Cluster of SMPs ? they will remain an efficient and low cost solution,
>(and quite easy to be sold
>by a mass vendor).
>And thanks to cluster of SMPs with the help of linux, the HPC market is
>Of course, it would be nice to have a true vector unit on a P4 or Opteron.
>But the problem will be the memory access again.
P4 has a lot of weak chains in its caches indeed, but there is weaker ones.
If we speak about opteron let's assume we want to do a bit faster
multiplication of a few big integers, but not too big, so FFT is useless,
but other methods are interesting to use.
You want to multiply 64 bits x 64 bits producing 128 bits result.
Opteron can do that every other cycle. So it has a 2 cycle latency so to
Way nicer would be every cycle 1 multiplication, or even 2.
So basically such multiplication work can get speeded up drastically by
improving the chip rather than the memory bandwidth.
A 2 fold speed improvement would not be so bad.
Oh about L1 cache, the L1 cache has no problems keeping up. It can do 2
reads simultaneously from L1 cache.
In fact all that complaints about memory bandwidth IMHO is a lack of
understanding how hardware works.
Please have a good look at a benchmark of dual core opteron. Yes it SHARES
the same memory controller for 2 cores now:
So you see diep has a scaling of 3.92 at a dual opteron, dual core.
That means basically that memory, despite it using a memory profile of
400MB, hardly is a problem.
Still not convinced?
Why not multiply a matrix size 200MB times a matrix 200MB.
How much memory bandwidth (non-L1/L2 cache bandwidth) does it generate,
when very efficient programmed, versus how many CALCULATIONS in the
hardware are needed?
If you put the 2 besides each other you will soon realize the real problem.
You want more multiplications+adds a second, not so much memory, assuming a
decent L3 cache size.
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf