[Beowulf] ARM cpu's and development boards and research
hahn at mcmaster.ca
Tue Nov 27 23:17:37 PST 2012
> What Bill has just described is known as an "Amdahl-balanced system",
> and is the design philosophy between the IBM Blue Genes and also
> SiCortex. In my opinion, this is the future of HPC. Use lower power,
> slower processors, and then try to improve network performance to reduce
> the cost of scaling out.
"small pieces tightly connected", maybe. these machines offer very nice
power-performance for those applications that can scale efficiently to
say, tens of thousands of cores. (one rack of BGQ is 32k cores.)
we sometimes talk about "embarassingly parallel" - meaning a workload
with significant per-core computation requiring almost no communication.
but if you have an app that scales to 50k cores, you must have a very,
very small serial portion (Amdahl's law wise). obviously,
they put that 5d torus in a BGQ for a reason,
not just to permit fast launch of EP jobs.
I don't think either Gb or IB are a good match for the many/little
approach being discussed. SiCortex was pretty focused on providing
an appropriate network, though the buying public didn't seem to
appreciate the nuance.
IB doesn't seem like a great match for many/little: a lot of cores
will have to share an interface to amortize the cost. do you provide
a separate intra-node fabric, or rely on cache-coherece within a node?
Gb is obviously a lot cheaper, but at least as normally operated is
a non-starter latency-wise. (and it's important to realize that latency
becomes even more important as you scale up the node count, giving each
less work to do...)
> Essentially, you want the processors to be
> *just* fast enough to keep ahead of the networking and memory, but no
> faster to optimize energy savings.
interconnect is the sticking point.
I strongly suspect that memory is going to become a non-issue. shock!
from where I sit, memory-per-core has been fairly stable for years now
(for convenience, let's say 1GB/core), and I really think dram is going
to get stacked or package-integrated very soon. suppose your building
block is 4 fast cores, 256 "SIMT" gpu-like cores, and 4GB very wide dram?
if you dedicated all your pins to power and links to 4 neighbors, your
basic board design could just tile a bunch of these. say 8x8 chips on
a 1U system.
> The Blue Genes do this incredibly well, so did SiCortex, and Seamicro
> appears to be doing this really well, too, based on all the press
> they've been getting.
has anyone seen anything useful/concrete about the next-gen system
interconect fabrics everyone is working on? latency, bandwidth,
> With the DARPA Exascale report saying we can't get
> to Exascale with current power consumption profiles, you can bet this
> will be a hot area of research over the next few years.
as heretical as it sounds, I have to ask: where is the need for exaflop?
I'm a bit skeptical about the import of the extreme high end of HPC -
or to but it another way, I think much of the real action is in jobs
that are only a few teraflops in size. that's O(1000) cores, but you'd
size a cluster in the 10-100 Tf range...
More information about the Beowulf