[Beowulf] Exascale?

Mon Dec 3 14:26:55 PST 2012

> I understand that true exascale would mean a
> more tightly knit system functioning as a single unit,

well, you can only claim a system is exascale if it can sustain
a single exa-level computation.  traditionally, that application
is HPL, which is convenient because it's obviously a single 
calculation, and yet compute-intensive enough (vs communication)
that it will normally approach peak FP performance.  (depending 
significantly on interconnect: Gb will approach a lower fraction
of peak than IB, for instance).

> but a distributed workload is in fact distributed
> no matter what the topologies or latencies end up being.

eh?

> And, isn't everyone saying exascale will take different thinking anyway?

I don't know about that.  what people who know something about are 
saying is that although we could scale up current hardware to hit exa,
the power would become a problem.

in other words, the challenge is flops/watt.  communication power
(cpu-memory, let along inter-node) is a big problem in this domain.

> Taking that view, one might argue that a network of clusters
> like google's or amazon's is already doing exascale level work
> of a very distributed workload.

nah.  you might as well just count all the computers in use everywhere
as a single computation.  bingo, we're already exascale!

> I wonder if something couldn't be learned from their
> small code and message passing algorithms.

nah.  botnets are not very impressive on technical grounds.
(perhaps more impressive on social engineering grounds.)

> I have also often wondered about the feasibility
> of running something like a BOINC-distributed project locally
> across all available personal machines in an organization
> to accomplish large calculation that were perhaps

search-like applications work well in this context, since they're 
both latency, failure and corruption-tolerant.  not everything is 
search or prime-cracking...

in short, HPC has been using distributed computation for a long time.
the issue is that if you distribute widely, you start running into 
problems of latency and reliability.  lots of workloads will scale 
very poorly unless you provide an interconnect which is low-latency,
high-bandwidth and/or high message-rate.