[Beowulf] core diameter is not really a limit

Mon Jun 17 08:12:57 PDT 2013

On Mon, Jun 17, 2013 at 10:37:02AM -0400, Mark Hahn wrote:

> > I think there is a large class of problems where direct
> > long-distance communication is not necessary. E.g. if
> 
> sure, but there are also interconnects which do not follow 
> the premises of this article (mesh/lattice/torus ones).

This is precisely the topology that I was talking about
(torus doesn't really work for nanoscale components, so
you'll be limited to native connectivity of the crystallographic
grid of the individual computational cells).

> I don't really have a sense for how large a class you're talking about,

In principle, the methods scale to self-gravitation bounded
assemblies, where relativistic latency prohibits any
global synchronization (hey, why not dream big?).

> though.  from a general-purpose/flexibility standpoint, a purely 
> local-interacting system, even if 3d, would seem to have issues.
> for instance, if you had discrete fileservers, how would they connect

If you have a free-running system (say, 1:1 scale augmented
reality system for cities or regions) then it fundamentally 
serves local users until hardware fails/there's a system
update, and can be headless, i.e. without central
servers (a P2P model, just as you're pixibooting a
cluster from neighbor nodes). Ditto for very long running jobs.

It's an interesting problem how you do periodic snapshots
if your relativistic pingpong is on second scale. 

> to a giant cube that is partitioned into job-subcubes?  obviously 
> scheduling jobs so that they're in contiguous sub-volumes is much
> more constrained than on a better-connected cluster, as well.

It looks like a one-job cluster. Swapping state in and out
would create too much overhead on such scale.

> I saw a conference talk by David Turek (IBM exascale) recently, wherein he
> was advocating coming up with a new model that gives up the kind of extreme
> modularity that has been traditional in computing.  it's a bit of a strawman,
> but the idea is that you have specialized cpus that mostly compute 
> and do not have much storage/state, talking to various permutations
> of random-access storage (cache, shared cache, local and remote dram),
> all talking over a dumb-pipe network to storage, which does no computation,
> just puts and gets.  (this was at HPCS, which was a slightly odd mashup
> of bigdata (ie, hadoop-like) and HPC (mostly simulation-type crunching)).
> 
> all this in the context of the sort of picojoule-pinching
> that exascale theorists worry about.  it wasn't really clear 
> what concrete changes he was advocating, but since since higher capacity
> clearly causes systems of greater extent, he advocates spreading 
> the computation everywhere.  compute in the storage, compute in the network, 
> presumably compute in memory.  the pJ approach argues that computations
> involving some state should happen as close to that state as possible.

Yes, this is precisely why you need to converge to a CA (crystalline computation)
model at the lunatic fringe. There your data and your computer, and your
network is all the same thing. You don't even need globally addressable
memory here, provide you can query clusters of state inside the volume
by encoding them into glider configurations. It's way too cramped at
nanometer scale to do directly computations there, you have to introduce
a virtual circuit layer.

> I'm skeptical that just because flops are cheap, we need to put them 

If your unit of computation is ~100 gate complexity cell, flops are not exactly
cheap.

> everywhere.  OTOH, the idea of putting processors into memory has always
> made a lot of sense to me, though it certainly changes the programming 
> model.  (even in OO, functional models, there is a "self" in the program...)

Memory has been creeping into the CPU for some time. Parallella e.g.
has embedded memory in the DSP cores on-die. Hybrid memory cube is
about putting memory on top of your CPU. Obviously, the next step
is mixing memory/CPU, even though that is currently problematic in
the current fabrication processes. The next step is something like
a cellular FPGA, and then you're in touching distance of the 
Fredkin/Toffoli paradigm. Not quite as weird to program as QC,
though of course potentially if you have a 3d lattice of spins
there might even some entanglement, assuming latter buys you
something -- very far from being convinced, there.