[Beowulf] Definition of HPC

Sat Apr 20 13:48:49 PDT 2013

On 4/19/13 12:31 PM, "Adam DeConinck" <ajdecon at ajdecon.org> wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>On Fri, Apr 19, 2013 at 05:10:37PM +0100, Tim Cutts wrote:
>> Anyone running a research computing setup has encountered both of these
>>issues.  Virtualisation mitigates the damage that can be done, without
>>the expense of an separate toy cluster, but it doesn't address these
>>support and transition to production issues.
>
>Yeah, these are the problems you're going to hit with either toy
>clusters or VMs. The main reason I like VMs for this is that it
>eliminates a whole class of "futzing around with hardware" problems:
>I might have to help dig people out of their holes occasionally, but
>at least I don't need to go down to their labs and untangle a rats'
>nest of cables.
>
>I'm lucky enough to have a smallish and relatively good-humored use
>community, so when someone really craters their test system, they
>usually have a six-pack of my favorite beer on hand when they ask for
>help. ;-) But that doesn't really scale...

Aha.. But beer *making* does scale.. In fact, small quantities are harder
than larger, so it has superlinear scaling.

But I guess what you're really getting at is that if your cluster were,
say, 10 times bigger, with 10 times as many users, your effectiveness in
providing support would be adversely affected by the 10 times larger beer
consumption.

Joking aside, I think that's one of the key differences as system size
(and cost) scales up.  Not only do million dollar/euro clusters attract
more accounting interest, they also are harder to administer in a casual
way.  

If you look back 20 years, there were great challenges in adapting to the
Beowulf cluster model with nodes interconnected by a "relatively slow"
interconnect (compared, say, to multiport memory), but by now, there's an
enormous amount of software which has either been modified or designed
from scratch to work well in a cluster architecture.  And a lot of it
scales really well from 10 to 100 to 1000 nodes.  Really smart people (who
are on this list, of course) have spent significant effort and succeeded
at this.

But there are aspects of scalability totally unrelated to the
processing/memory/disk/interconnect architecture.  Physical plant (look at
the discussions on HVAC, liquid cooling, etc.).  Sys Admin (root access,
sharing, batch queues, partitioning).  Financial and Business Admin
(chargebacks, accounting for time, amortization).  We had discussed some
of this 10 years or so ago in the context of "how big a cluster can one
person manage" and issues of the granularity of staff.  If you have
someone with the right skills who's not fully occupied, then the
incremental admin cost to add nodes is small.

In some ways, I think those problems are actually harder, because they
have constraints and evaluation functions that are a lot less tangible,
less tractable, and less measureable. How do you deal with the "when
something is expensive, more levels of management get involved" aspect.
That's getting into sociology more than engineering.