[Beowulf] Utility Supercomputing...

Sat Mar 2 17:56:52 PST 2013

> had some of all of this.  The cool thing about IaaS providers is they
> can (not always) get a hold of more recent kit, and if you are smart
> about how you use it you can see huge benefits from all of this, even
> down to simple changes to CPU spec at a simple level, to much larger
> wins if you fix all of the bad/old/shabby/poor stuff.

color me skeptical.  I don't see why a computing service provider
has any greater access to new kit than the service user (except
in the small-user niche.)  unless there's some externality that 
gives them, say, a tax advantage, and allows them to depreciate HW
at a higher rate.

if Amazon installs a new haswell/80GIB/Phi cluster just because I ask
for it, they're going to charge me the cost + operating + margin.
sure, they can pool it other users if my duty cycle is low, but in a
macroecon sense, the only thing that's happening by renting 
rather than owning is that they make a profit.

> PUE is starting to be a big issue.  We built the MGHPCC as both a
> responsible response to sustainable energy costs (hydro), and the fact
> that on campus machine rooms were heading towards the 3.0 PUE scale,
> which is just plain irresponsible.

forget the moral outrage: it's just incompetent.  AFAIKT, institutional IT
people cultivate an inscrutable BOFHishness (in the guise of security or
reliability or some other laudable metric.)  I always say that HPC is an
entirely different culture than institutional IT.  that's certainly the case
on my campus, where the central IT organization is all excited about
embarking on a "5 year" >$50M ERP project that will, of course, _
do_everything_, and do it integrated.  they have the waterfall diagrams to
prove it.

I consider the web2.0 world to be part of the HPC culture, since companies
like Google/FB/etc are set up to avoid BOFH-think and readily embrace
traditional research/hacker virtues like modularity/iteration/experimentation.

> software.  Smart software costs a whole lot less than fixed
> infrastructure.

that seems to be the central claim, for which I see no justification.
any HPC organization does devops, whether they call it that or not
(just as HPC is normally PaaS, though not often by that name.)

why do you think "smart" software somehow saves money in such a way
that a large, compute-intensive organization couldn't do the same
but by buying rather than renting?

> Re: "special sauce" - there is a fair amount of it at scale. I've seen
> inside the source for what Cycle does, and have also run 25K+ proc
> environments, we all need some sauce to make any of this large scale
> stuff even slightly palatable!

color me skeptical again.  admittedly my organization's largest cluster
is ~8500 cores, but such clusters are pretty trivial, devops-wise.
I doubt that some crazy nonlinearity kicks in between 8k and 25k.

>> the one place where cloud/utility outsourcing makes the most sense is at
>> small scale.  if you don't have enough work to keep tens of racks busy,
>> then there are some scaling and granularity effects.  you probably can't
>> hire 3% of a sysadmin, and some of your nodes will be idle at times...
>
> Agree - I've seen significant win at both the small end and the large
> end of things.  Utility being a phrase that identifies how one can
> turn on and off resource at the drop of a hat.  If you are running

but that's totally ho-hum.  it takes less than one hat-fall-time to 
ssh to our clusters and submit a job.  if we provisioned VMs, rather 
than ran resource-dedicated jobs on shared machines (IaaS vs PaaS), 
there would be some modestly larger latency for starting the job.
(I guess 10s to boot a VM rather than 10ms to start a job.)

> nodes 100% 365d/y on prem is still a serious win.  If you have the odd

afaikt, you're merely saying "users with sparse duty cycle benefit from
pooling load with other, similar users".  that's true, but doesn't help
me understand why pooling should be able to justify both Amazon's margin
and *yours*.  sorry if that's too pointed.

I'm guessing the answer is that your market is mainly people who are too
self-important (anything health- or finance-related) to do it themselves
(or join a coop/consortium), or else people just getting their toes wet.

> my, if I'd have used that for a whole year we would have gone broke,
> but it is totally awesome to use for a few hours, and then go park it

again, you're merely making the pitch for low-duty-cycle-pooling.
to me, this doesn't really explain how multiple layers of profit margin
will be supported...

>> I'm a little surprised there aren't more cloud cooperatives, where smaller
>> companies pool their resources to form a non-profit entity to get past these
>> dis-economies of very small scale.  fundamentally, I think it's just that
>> almost anyone thrust into a management position is phobic about risk.
>> I certainly see that in the organization where I work (essentialy an
>> academic HPC coop.)
>
> You guys proved this out:
>
> http://webdocs.cs.ualberta.ca/~jonathan/PREVIOUS/Grad/Papers/ciss.pdf

I've still got the t-shirt from CISS.  we went along with it primarily
for PR reasons, since the project's axe-to-grind was pushing the
cycle-stealing grid concept.  you'll notice hardly anyone talks about 
grid anymore, although it's really a Something-aaS (usually P in those days,
since hardware VM wasn't around much.)

Grid people seemed to love tilting at the geograpic/latency windmill,
which I think we all know now was ah, quixotic.

>> people really like EC2.  that's great.  but they shouldn't be deluded into
>> thinking it's efficient: Amazon is making a KILLING on EC2.
>
> It's as efficient as "the bench" shows you, no more no less.

you're talking about a single purchasing choice; I'm trying to understand
the market/macro.  if you're suggesting that running on a VMed cluster is 
more efficient than a bare-metal cluster (with comparably competent ops),
well, let me just call that paradoxical.

>>> systems than some of us have internally, are we are starting to see
>>> overhead issues of vanish due to massive scale, certainly at cost?  I know
>>
>>
>> eh?  numbers please.  I see significant overheads only on quite small
>> systems.
>
> You are right numbers were missing.  Here's an example of a recent Cycle run:
>
> http://blog.cyclecomputing.com/2013/02/built-to-scale-10600-instance-cyclecloud-cluster-39-core-years-of-science-4362.html

Amazon's spot price for m1.xlarge instances is indeed surprisingly low.
I guess my main question is: once you succeed in driving more big pharma
work into the EC2 spot market, wouldn't you expect the spot prices to 
approach on-demand prices?

m1.xlarge instances sound like they are between half and a quarter of a 
modern 2s 16c E5 server, which cost $4-5k.  that makes your sqft number
and price numbers off by a factor of two (so perhaps you credit the instances
with more speed - I've never tried to measure them.)

> So ignoring any obvious self promotion here, 40yrs of compute for
> $4.3K is a pretty awesome number, least to me.  Probably compares with

your spot instances were around $0.052/hour, and if you had bought
servers at list price, you'd pay about .046 (that's assuming you run
them for 5 years, but *not* counting operating costs.)

> some academic rates, at least it is getting close, comparing a single
> server:
>
> http://www.rhpcs.mcmaster.ca/current-rates

heh, sure.  I'm not really talking about small-scale hosting, though.

> I do have to say  - I absolutely LOVE that you guys put in real FTE
> support for these servers - this is a very cool idea, I never did
> implement charge back in my old gig, what you guys are doing here is
> awesome!

actually, RHPCS is providing something more like traditional IT support
there - my org, Sharcnet does unlimited free research PaaS HPC
(as does the rest of ComputeCanada).  ComputeCanada divisions have made some
attempts at cost recovery in the past 10ish years, and pretty much
failed each time...

> My point was more around the crazy "elastic scale" one can now pull
> off with the appropriate software and engineering available from top
> tier IaaS and PaaS providers.

puzzled.  is it any different from logging into a large cluster and
submitting a 40k cpu job?  (of course, in the example you gave, it 
was really 10k 4-cpu jobs, which is certainly a lot less crazy...)

regards, mark hahn.