[Beowulf] Selling computation time

Thu Dec 28 12:15:54 PST 2006

Robert G. Brown wrote:

> The other is numerical HPC applications.  Here the marketplace is one
> where it is difficult to achieve a win.  First of all, most people who
> are doing HPC have very specific, very diverse, applications and often
> these applications run on clusters that are at least to some extent
> custom-engineered for the application.

Hmmm... I have a recursive joke I like to use in cases like this ...

Gross generalizations tend to be incorrect.

There are many clusters that end users want their machine to run an app 
or three.  These apps are "fixed" or very slowly changing, requirements 
are well defined and static.

I think it might be a better approximation to say that clusters to a 
large extent mirror their intended usage pattern.  The people who will 
run commercial or even non-commercial codes and do little to no 
modification, could be well served by a "fixed" resource.

The folks working on more "researchy" type things, where they do code 
development as well as use the machines for simulation/analysis, are 
more likely to want finer grained control.

> A general purpose commercial cluster would face immediate problems
> providing a "grid-like" interface to the general population.  Even

See above:  For the group for which this is a fixed immutable resource, 
this is not generally correct, "grid-like" or web-access to cluster 
resources works, quite well in fact.  Bug me off line so I can avoid 
spamming if you want to know more.

> something as simple as compiling for the target platform would become
> difficult, and is one of those areas where solutions on real grid
> computers tend to be at least somewhat ugly. 

For those doing development and more research-like things not using 
fixed resources (programs, ...), yes, this is an issue.

> Then there is access,
> accounting, security, storage, whether or not the applications is EP or
> has actual IPCs so that it needs allocations of blocks of nodes with
> some given communications stack and physical network.

Again, lots of this stuff is handled quite well for programs that change 
infrequently.

> By the time one works out the economics, it tends to be a lose-lose
> proposition. 

I disagree, specifically for the groups that need the fixed resources.

> It is just plain difficult to offer computational
> resources to your potential marketplace:
> 
>   a) in a way that they can afford -- more specifically can write into a
> grant proposal to afford.

Cycles for sale would not fit into a grant model, which wants to 
minimize the variable and fixed costs.  That is, this would not work 
well for a researcher in most cases.

> 
>   b) in a way that is cheaper than they could obtain it by e.g. spending
> the same budget on hardware they themselves own and operate.  Clusters
> are really amazingly cheap, after all -- as little as a few $100 per
> node, almost certainly less than $1000 per CPU core even on bleeding
> edge hardware.  Yes, there are administrative costs and so on, but for
> many projects those costs can be filled out of opportunity cost labor
> you're paying for anyway.

Hmmm.... I would caution anyone reading (or writing.... cough cough) to 
prefix this with "for a specific configuration, not necessarily detailed 
here".  Lest end users expect Infiniband connected dual/quad core 
machines with 8 GB ram per core for $100/core (or even $1000/core).

It is hard to get a real "rule of thumb" for configurations, as a) 
everyone has a different "base" configuration, b) prices fluctuate, 
sometimes wildly, c) configuration drastically impacts cost.  We 
strongly advise people considering such machines to survey the market 
*before* taking action.  You may be sorely disappointed what 
$1000/socket or $1000/core can buy you if your expectations are set way off.

Self education is always best.  Ask questions, see what people are 
paying.  Don't take much more than the order of magnitude as being 
correct in rules of thumb.

>   c) and, if you manage a rate that satisfies b), that still makes YOU
> money.  Dedicated cluster admins are expensive -- suppose you have just
> one of these (yourself) and are willing to do the entire entrepreneurial
> thing for a mere $60K/year salary and benefits (which I'd argue is
> starvation wages for this kind of work).  A 100-node (CPU) pro-grade
> cluster will likely cost you at LEAST $50,000 up front, plus the rent of
> a physical space with adequate AC and power resources plus roughly
> $100/node/year -- call it between $10 and $20K/year just to keep the
> nodes powered up, plus another $5000 or so in spare parts and
> maintenance expenses.  The amortization of the $50K up front investment
> is over at most three years (at which point your nodes will be too slow
> to be worth renting anyway, and you'll likely have to drop rental rates
> yearly to keep them in a worthwhile zone as it is, so call it $20K of
> depreciation and interest on the borrowed money per year plus $20K in
> operating expenses per year plus $60K for your salary -- you have to
> make about $100K/year, absolute minimum, just to break barely arguably
> not quite even.

[...]

> 
> The point being that with VERY FEW EXCEPTIONS the economics just doesn't
> work out.  Yes, some things can be scaled up or down to improve the

I disagree with the "VERY FEW EXCEPTIONS" portion.  Actually it works 
out very nicely in specific cases.  Maybe this is "FEW".  I dunno.

Where the cost of the hardware is relatively high, and the frequency of 
use is not.  You need to do an occasional run on a 128 node machine to 
validate some of your tests.  Maybe once a quarter.  This run will take 
a day on that machine.  You cannot possibly run this on your existing 
hardware, it is way too large.  Yet for under $5k per quarter, you can 
do this run.  For under $20k/year you can do all your runs that you 
occasionally need to do.

Put another way.  Take your machine utilization per year, divide by the 
number of CPUs in it, and you get your average utilization per CPU per 
year.  Now take your depreciation cost per year, divide that per CPU, 
and you get your average depreciation cost per CPU.  Even multiply that 
bv some appropriate factor to handle overhead (power, cooling, Mark Hahn 
clones, ...).  If you have "high" utilizations (over 70%) over the 
entire year, then you are "wasting" very little of that depreciation. 
Call the excess = 100% - utilization, and multiply that by your 
depreciation cost per year per CPU.  This is what you are effectively 
"throwing away" per CPU per year.  If you have many nodes and low 
utilization, your aren't effectively utilizing your resource and likely 
could have been using a smaller resource (e.g. a less costly one) more 
effectively.

Note that this scales both up and down, I picked 128 node as an example. 
   There are good models to suggest a lower bound size on this, and 
there are good models to show where it makes sense for bulk cycle delivery.

Simply put, when you work out a reasonable costing model, and work out 
the business plan in detail, this model can work (and does work) for 
specific types of usage.

> basic picture I present, but only at a tremendous risk for a general
> purpose business.  The only exceptions I know of personally are where
> somebody is already operating a cluster consultation service --
> something like Scalable Informatics -- that helps customers design and
> build clusters for specific purposes.  In some cases those customers

Heh... We do far more than that these days.  I'll avoid a commercial ... 
  Bug me off line if you want it.

> have "no" existing expertise or infrastructure for the cluster they need
> and can obtain a task-specific win by effectively subcontracting the
> cluster's purchase AND housing AND operation to SI, where SI already
> has space and resources and administration and installation support set
> up and can install and operate their client's cluster with absolutely
> minimal investment in node-scaled time.

Actually this model, is the important one.  Lower the barriers to usage, 
so that they can call up, get started right away.  There's lots more to 
it, building a viable and sustainable business model for this is *not 
easy* by any stretch of imagination.  What soured most people (and 
investors) on this in the late 90's and early 00's was the promise that 
end users would run to this model going forward due to the huge cost of 
running their own stuff.

Turns out that the assumption of the huge cost of running their own 
stuff was small compared to the cost of the aging of the hardware.  As 
Robert pointed out, as the hardware ages, end users get less value. 
Call this the Moore's law based value drain.  In 2 years time, it is 1/2 
  the speed of the new stuff.  Unless your depreciation model takes this 
into account, and front-loads these costs into your pricing, you are 
going to lose, as your customers expect (and in one case, demanded) that 
we have the top-o-the-line stuff up, always.  Of course, this is just 
like the new-car effect.  As soon as you drive your new car off the 
dealer lot, it loses value.  As soon as your machines start to age, they 
lose value.  And few customers want to pay a premium to work on such 
machines, or on any machines for that matter.

What they want is to buy cycles in bulk, either delivered from local 
machines, or remote machines (most prefer local).  A cluster is a bulk 
supply of processing cycles.  You want the most cost effective cycles 
that maximize your productivity.  In the majority of cases, these cycles 
are best procured locally.  Get the cycles in house, and use them.  In 
some cases, the costs to host locally would be far higher than remote 
hosting, and the cycles aren't needed that often.  These are the cases 
where you need this sort of service.  The problem is the pricing model. 
  Cycles are not premium services, they are bulk supply.  The more you 
can supply the better.  The more that are cost effective to supply the 
better.

Where I am going with this is that the vast majority of the ASP business 
models of old were garbage (fit with the times).  Now saner business 
models have to a degree emerged, but there are no willing investors in 
such things.  Hence the cycles that are delivered tend to be "extra".

> 
> Note that doing THIS provides you with all sorts of things that alter
> the basic equation portrayed above -- you already have an income from
> the consultative side and don't have to "live" on what you make running
> clusters, you don't actually buy the cluster and rent it out, the client
> buys the cluster and pays you to house and run it (so you don't have to
> deal with depreciation or rollover renewal, they do), you have
> preexisting but scalable infrastructure support for cluster installation
> and software maintenance, etc.  Even here I imagine that the margins are
> dicey and somewhat high risk, but maybe Joe will comment.  Maybe not --
> I doubt that he'd welcome more competition, since there is probably
> "just enough" business for those fulfilling the need already.  I don't
> view this as a high-growth industry...;-)

Nah, nothing to see here, move along, move along ....

The important thing in any viable business is adaptation, ability to 
adjust the business model to reflect how customers want to work, in such 
a way that it is a win for everyone.  We have adapted and changed over 
time, and continue to do so.  This is in part as our customers needs 
have changed, and what problems they consider to be major ones have changed.

HPC is a great market.  No significant investment capital interest to 
speak of (this is a down side), but good growth year over year 
(10-20+%), a large size ($10B), and Microsoft just joined the market. 
Maybe, someday, the downside will be fixed.  Competition tends to follow 
from investment.

Joe

> 
>    rgb
> 

-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615