[Beowulf] Given all the recent infrastructure talk...

Thu Oct 4 11:12:11 PDT 2012

On 10/04/2012 01:30 PM, Andrew Holway wrote:
>> bitter?  sure.  to me Canadian HPC is on the verge of extinction,
>> partly because of this issue.
>
> Is Canadien HPC a distinct entity from US HPC?

Quite distinct.  US has XSEDE and a number of other national/regional 
and national lab initiatives.  Canada has SharcNet and other things
(ComputeCanada).

[...]

> I wonder if there is a HPC 'critical mass'.

For business?  Some what.  For unis and research/edu in general?  Looks 
to me like lots of support.

Mark's point though was this:

> I think in some sense, the problem is that in academic HPC organizations,
> decisions are typically made by academics recruited to be management,
> and they have either a high fear/expectation of failure or a low expectation
> in being able to fix problems that do arise (or both).  it's crippling,
> and being emotional, prevents such organizations from considering how to
> rationally estimate the risks, and to design the process to manage it.
>
> in a sense, beowulf has been corrupted by its own success.
> hacking (in the classic sense) is inherently risky

I don't know the actual state of HPC in Canada, and as Mark works in 
this, I'd say his view is likely far more accurate than I could guess.

Researchers sometimes make good managers, sometimes they don't.  Risk 
aversion by choice of brand name is one way to avoid making careful risk 
analyses, and substitute them with something of lower value, which may 
not be valid ... but hey, no one was ever fired for choosing 
IBM/Microsoft/... (insert large brand name here).  I think the term Mark 
used was "sclerosis".  I believe this is an apt description and a 
correct description.

With respect to "cutting out the middleman" point that Mark made, there 
are costs and benefits to every decision.  We've seen great designs from 
good architects at various places.  We've seen just awful/terrible 
designs at many others.

Google designs to their needs, as does FB.  They buy enough quantity 
that the costs associated with their efforts are lower if they can 
control the BOM going into the parts.

This isn't true of everyone.

Moreover, their failover model doesn't engineer "enterprise" features 
into their systems, think large RAIN (Redundant Array of Inexpensive 
Nodes) scenario.  They are engineering for failure, at a coarser grain 
(extra-unit) level, so they don't need to pay for failure avoidance at a 
fine grain (intra-unit) level beyond failure detection.

Google and FB are, to a degree, taking Beowulf (design what you need, 
engineer at the software stack to handle management and other issues) to 
the next level.  This isn't BYOC (build your own cluster), this is BTPYN 
(build the platform you need).  Paying for extra stuff they don't need 
across N servers (where log(N).base10 >= 5) makes no sense.  Paying 
"middlemen" to do what they want makes no sense.  Contracting with 
Quanta et al to design/build to their specs makes a great deal of sense.

Regards,

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615