[Beowulf] Given all the recent infrastructure talk...
landman at scalableinformatics.com
Thu Oct 4 11:12:11 PDT 2012
On 10/04/2012 01:30 PM, Andrew Holway wrote:
>> bitter? sure. to me Canadian HPC is on the verge of extinction,
>> partly because of this issue.
> Is Canadien HPC a distinct entity from US HPC?
Quite distinct. US has XSEDE and a number of other national/regional
and national lab initiatives. Canada has SharcNet and other things
> I wonder if there is a HPC 'critical mass'.
For business? Some what. For unis and research/edu in general? Looks
to me like lots of support.
Mark's point though was this:
> I think in some sense, the problem is that in academic HPC organizations,
> decisions are typically made by academics recruited to be management,
> and they have either a high fear/expectation of failure or a low expectation
> in being able to fix problems that do arise (or both). it's crippling,
> and being emotional, prevents such organizations from considering how to
> rationally estimate the risks, and to design the process to manage it.
> in a sense, beowulf has been corrupted by its own success.
> hacking (in the classic sense) is inherently risky
I don't know the actual state of HPC in Canada, and as Mark works in
this, I'd say his view is likely far more accurate than I could guess.
Researchers sometimes make good managers, sometimes they don't. Risk
aversion by choice of brand name is one way to avoid making careful risk
analyses, and substitute them with something of lower value, which may
not be valid ... but hey, no one was ever fired for choosing
IBM/Microsoft/... (insert large brand name here). I think the term Mark
used was "sclerosis". I believe this is an apt description and a
With respect to "cutting out the middleman" point that Mark made, there
are costs and benefits to every decision. We've seen great designs from
good architects at various places. We've seen just awful/terrible
designs at many others.
Google designs to their needs, as does FB. They buy enough quantity
that the costs associated with their efforts are lower if they can
control the BOM going into the parts.
This isn't true of everyone.
Moreover, their failover model doesn't engineer "enterprise" features
into their systems, think large RAIN (Redundant Array of Inexpensive
Nodes) scenario. They are engineering for failure, at a coarser grain
(extra-unit) level, so they don't need to pay for failure avoidance at a
fine grain (intra-unit) level beyond failure detection.
Google and FB are, to a degree, taking Beowulf (design what you need,
engineer at the software stack to handle management and other issues) to
the next level. This isn't BYOC (build your own cluster), this is BTPYN
(build the platform you need). Paying for extra stuff they don't need
across N servers (where log(N).base10 >= 5) makes no sense. Paying
"middlemen" to do what they want makes no sense. Contracting with
Quanta et al to design/build to their specs makes a great deal of sense.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf