[Beowulf] Re: vectors vs. loops

Tue May 3 18:18:47 PDT 2005

On Wed, 4 May 2005, Vincent Diepeveen wrote:                                    

> This isn't at all the problem. So to speak for highend processors even a      
> medium engineer can add a few on die memory controllers to solve the          
> bandwidth problem. 1 for every few cores.  

I'm not a hardware engineer, but it seems to me that pin count becomes an
issue.  With N cores and M memory controllers, there would need to be
M*bus_width pins on the socket and M independent banks of RAM modules.  
That doesn't seem viable as producing a chip package with such a high pin
count and motherboards with so many traces will not be very economical.  
The motivation for multicore chips is precisely to avoid that packaging
and motherboard complexity.  But, to do it another way (say, with memory
controllers sharing a bus to memory) would not provide scalable memory
performance due to bus contention.

The Cell processor seems to do away with cache for its vector units and 
instead allocates a flat scratch space of memory in its place.  Data has 
to be DMA to and from main memory explicitly.  This avoids all of the 
cache coherency logic.  This does provide scalable memory performance...at 
least to the limits of that memory space.

It is interesting to note that these CPU/memory units in the Cell look
more and more like a "computer-on-a-chip" and the Cell vector unit itself
is basically a "cluster-on-a-chip."  Communication to main memory and to
other vector units needs to be explicitly scheduled (a'la MPI).  Each core
is optimally balanced in memory bandwidth and computational performance.  
The data that it works on is local *by definition* and hence fast.  And
the messy logic associated with cache maintenance gets flushed away, but
at the expense of requiring explicit fetching/message passing.

This may well be the future.  Small cores with ever growing "scratch"  
space that uses explicit fetch/stores to main memory.  Maybe some/most of
this will change from SRAM to a DRAM to make these local memories larger.  
Whatever the case, main storage needs to find its way closer to the
processing core.  Double/Quad pumped buses and 128-bit memory channels can
only take us so far.

Cell is a test architecture.  It will be interesting to see if the
compilers can uncover the parallelism and make good use of this approach
or if the technology will fade into the background like VLIW/EPIC mostly
has.

Mike Prinkey