[Beowulf] ARM cpu's and development boards and research

Wed Nov 28 01:24:34 PST 2012

On Wed, Nov 28, 2012 at 02:17:37AM -0500, Mark Hahn wrote:

> "small pieces tightly connected", maybe.  these machines offer very nice
> power-performance for those applications that can scale efficiently to 
> say, tens of thousands of cores.  (one rack of BGQ is 32k cores.)

Consider a silicon compiler than compiles your problem to an
FPGA, only in a 3d volume, and where each cells only need some
10-100 transistor equivalents (measured in current solid state 
structures). Such approaches can fundamentally result in
installations with mole (Avogadro number) of bits in
single sites, if you're (self-)assembling molecular scale
components (not yet, but wait 40-50 years).

> we sometimes talk about "embarassingly parallel" - meaning a workload
> with significant per-core computation requiring almost no communication. 
> but if you have an app that scales to 50k cores, you must have a very, 
> very small serial portion (Amdahl's law wise).  obviously, 

There are problems which have no serial portions at all.
In fact, most physical simulations can be stated in that
way, at least theoretically.

> they put that 5d torus in a BGQ for a reason, 
> not just to permit fast launch of EP jobs.
> 
> I don't think either Gb or IB are a good match for the many/little
> approach being discussed.  SiCortex was pretty focused on providing 
> an appropriate network, though the buying public didn't seem to 
> appreciate the nuance.

I think it's a question of price. Beowulf is tranditionally
about COTS, but SoCs are off the shelf these days, and 
small companies or individuals will bridge the small gap for a custom
board and power supply for little money, if you're buying
10 or 100 k units. SeaMicro might be just great, but I'm having a
feeling they won't be competing in the same market segment
as Supermicro.

It will be a while until you can buy an ARM SoC box with
kilonode or so from Supermicro for, say, 3-6 kUSD.

> IB doesn't seem like a great match for many/little: a lot of cores 
> will have to share an interface to amortize the cost.  do you provide 
> a separate intra-node fabric, or rely on cache-coherece within a node?

Cache coherence doesn't scale. Look at how Adapteva does it,
they have embedded memory (32 k, planned up to 1 MByte) in
each SHARC-like core, sitting on an on-die mesh. There is no
reason why that mesh wouldn't be scalable. FWIW, Adapteva
is only one order of magnitude away from exascale, if your
problem matches the few-kBytes/core model without requiring
to access main node RAM, which can be a stacked RAM in
principle, so it would give you 100-200 GByte/s throughput,
which is GPGPU ballpark of memory bandidth.

> Gb is obviously a lot cheaper, but at least as normally operated is 
> a non-starter latency-wise.  (and it's important to realize that latency
> becomes even more important as you scale up the node count, giving each
> less work to do...)

If you can route around dead dies (whether at fabrication stage,
or during operation), then you can put your mesh on a 300 mm wafer.
Even with local links only you'd have 4-6 links. You can map
a volume to flatland with 4-6 links (e.g. diamond lattice).

> > Essentially, you want the processors to be
> > *just* fast enough to keep ahead of the networking and memory, but no
> > faster to optimize energy savings.
> 
> interconnect is the sticking point.
> 
> I strongly suspect that memory is going to become a non-issue.  shock!
> from where I sit, memory-per-core has been fairly stable for years now
> (for convenience, let's say 1GB/core), and I really think dram is going
> to get stacked or package-integrated very soon.  suppose your building

Yes, including real TSV stacks, which need cool nodes not to fry.
I also expect that MRAM will appear there, since it's very like
SRAM, and needs no refreshing nor leaks.

> block is 4 fast cores, 256 "SIMT" gpu-like cores, and 4GB very wide dram?

If you want die yield to be >80% you have to limit die size, and
fastest RAM is one in the CPU core itself.

> if you dedicated all your pins to power and links to 4 neighbors, your
> basic board design could just tile a bunch of these.  say 8x8 chips on 
> a 1U system.

http://www.youtube.com/watch?v=JQCP85FngzE

> > The Blue Genes do this incredibly well, so did SiCortex, and Seamicro
> > appears to be doing this really well, too, based on all the press
> > they've been getting.
> 
> has anyone seen anything useful/concrete about the next-gen system
> interconect fabrics everyone is working on?  latency, bandwidth,
> message-throughput, topology?

I'm really looking forward to benchmark Epiphany's fabric,
assuming they don't fall on their faces

http://www.adapteva.com/products/silicon-devices/e64g401/

> > With the DARPA Exascale report saying we can't get
> > to Exascale with current power consumption profiles, you can bet this
> > will be a hot area of research over the next few years.
> 
> as heretical as it sounds, I have to ask: where is the need for exaflop?

Let's say we've scanned a 450 mm^3 biological system at ~8 nm
resolution, have traced out the connectome and want to simulate
it at useful speed, say better than 10^3 slower than wall clock,
preferably, in realtime. 

You could try it like this
http://www.stanford.edu/group/brainsinsilicon/documents/Overview.pdf
but this is custom hardware, and these are toy neurons.
If you want to do it a la Blue Brain, it gets expensive, fast.

> I'm a bit skeptical about the import of the extreme high end of HPC - 

4k photorealism at 60+ fps with full physics at <1 kW footprint.
Most gamers would sell their souls for this. This isn't exascale,
but it would just use a small slice of the same pie.

> or to but it another way, I think much of the real action is in jobs
> that are only a few teraflops in size.  that's O(1000) cores, but you'd
> size a cluster in the 10-100 Tf range...