[Beowulf] Is there really a need for Exascale?

Fri Nov 30 06:26:52 PST 2012

On Fri, Nov 30, 2012 at 02:00:03PM +0000, Lux, Jim (337C) wrote:

> Yes. The cache essentially serves as a smart virtual memory. I suppose
> that a question might be what is the optimum granularity of that I cache..
>  And would there be a better architecture for that cache vs the data
> cache, that allowed more/faster access because you KNOW that the dynamics
> of access are different.

Let's say you want to maximize on-die SRAM-type memory for
a given budget of transistors (so that you, say, get 80-90%
total die yield, so you can go wafer-scale integration to reduce
costs (dead dies are left in place, and routed around) 
and increase on-wafer mesh/torus throughput (due to much
smaller geometries)).  

You can handle object(method) swapping to external, slower memory
via the OS. As that memory is mapped into the address space
(effectively being a very large register file, or a zero page of 65xx)
you could also explicitly allocate it from within the code, 
at least as a preference, and use garbage collection if
explicit deallocation is a problem.

Notice that relativistic latency alone across a 300 mm
wafer is 1-2 ns (aka the fabled Grace Hopper 30 cm 
nanosecond), so for cache coherency you'd sacrifice
a lot of time, even for that small area of silicon real
estate.

Notice that ARM or SHARC-like cores are very fat
if compared to minimalistic designs like GA144
http://www.designspark.com/blog/hands-on-with-a-144-core-processor

The journey goes even farther with
http://low-powerdesign.com/sleibson/2011/09/04/the-return-of-magnetic-memory-a-review-of-the-mram-panel-at-the-flash-memory-summit/
if combined with 

http://apl.aip.org/resource/1/applab/v86/i1/p013502_s1?bypassSSO=1

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4387374&contentType=Conference+Publications

This is fully static FPGA but without crossbar,
using only locally connected cells which are
fully reconfigurable, also at runtime.

There is no reason why this wouldn't work
in 3d, a la serially deposited multilayer,
or even full volume 3d integration.

> Or, is the generic approach actually better in the long run, because it's
> less specialized and therefore less dependent on clever compiler and
> coding tricks to optimize performance.