[Beowulf] Station wagon full of tapes

Tue May 26 11:10:04 PDT 2009

On May 26, 2009, at 10:20 AM,  "Robert G. Brown" <rgb at phy.duke.edu>
> Subject: Re: [Beowulf] Station wagon full of tapes
> To: Jeff Layton <laytonjb at att.net>
> Cc: Beowulf Mailing List <beowulf at beowulf.org>
> Message-ID: <alpine.LFD.2.00.0905260946000.4021 at localhost.localdomain>
> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>
> On Tue, 26 May 2009, Jeff Layton wrote:
>
>> I haven't seen the cloud ready yet for anything other than  
>> embarrassingly
>> parallel codes (i.e. since node, small IO requirements). Has anyone  
>> seen
>> differently? (as an example of what might work, CloudBurst seems to  
>> be
>> gaining some traction - doing sequencing in the cloud. The only  
>> problem
>> is that sequencing can generate a great deal of data pretty rapidly).
>
> I'm pretty skeptical of commercial rent-a-cluster business models.

To me "Cloud" implies "embarrassingly parallel" infrastructure.  High  
speed interconnects (which I would loosely define as "better than std  
GbE" are not cloud friendly... and really require a "Cluster for Hire"  
arrangement.  Also, there's no "throat to choke" in the clouds, but my  
customers know how to find me to let me know when something isn't fair  
or could be done better for their specific use case.

It's actually a great business when you have the right mix of clients  
and expectations.  IRS depreciation rules make it tough to survive...  
but this  is a double edged sword since depreciation rules drive some  
of the bleeding edge users to external systems so they can stay  
bleeding edge every year without internal accounting fights.

> Is this true in cluster computing?  Nearly every cluster computer  
> user I
> know of wants "infinite capacity", not finite capacity.  They are
> limited by their budget, not their needs.  Give them more capacity,
> they'll scale up their computation and finish it faster or do it  
> bigger
> or both.

This is where the "value" of a 2nd or 3rd year cluster really comes  
through for "value" researchers.... would they rather spend their  
money on a few systems for a long run, or a lot of systems for a short  
run?  Since many researchers are "project" based and funded, access to  
a large system for short bursts of time could help a single researcher  
work through more "projects" or ideas faster than on his own simple  
cluster even if the "Per hour" charge is more... they are able to  
accomplish more with their time and work through more brilliant  
ideas...  We can work out the "time value" of money, but what is the  
"time value" of a brilliant mind waiting for an answer?  This is the  
reason Departmental or Enterprise class HPC systems *should* be the  
minimum scale an organization builds in house.  Some commercial ISV's  
will often donate the temporary licenses to help test scaling limits  
for their customers before a big purchase is made.

> Plus edge cases -- somebody that needs a cluster desperately, but only
> for six months and to do one single computation (how common is that?
> not very, I think).

Six months is a long time.  We routinely see requests for hundreds of  
compute cores in the < 1 Month range.  They aren't "edge" or  
"desperate", they just understand the "inertia" of their internal  
systems group, vendors etc. and "peak shaving" is the best way to keep  
their in house system from being crushed under the weight of the  
queues.... while justifying longer term increases in capacity in  
house.  Outsourced computing for this shouldn't be thought of as "all  
or nothing", it's just an added tool for the Researcher and his IT  
staff to manage when it fits their needs.

The researcher who needs the extra cycles may not be the one who is  
ultimately computing off site either.... if they have peers on the  
internal system that can more easily deal with the limits of network  
bandwidth then bartering can ensue and help everyone get more done  
without blowing the budget or wasting a lot of wall clock time.   
Ultimately Fedex is the highest bandwidth network, with terrible  
latency.  We have rarely had to resort to this "fall back" network so  
far but are glad to know it is there and simple to manage.

In the end, it's all about Balancing the Bleeding edge user's needs  
with the Value edge user's needs.  If we can load the teeter totter up  
with enough of each, much more research will be accomplished at a  
better overall value for all the researchers.  The business of selling  
cycles on shared systems has been around since the first "virtual  
machines" hit the mainframe market long ago.  It's been called many  
things, and funded many ways, but in the end scaling up to a specific  
plateau is more efficient assuming the machine stays reasonably loaded  
over it's useful life.

Researchers able to suffer occasionally long availability delays and  
"preemption" will find some really good deals backfilling our  
systems.  This way we can have our utilization and our availability  
(for unexpected jobs) too.  Some day it should be considered an  
ethical and ecological crime to have unused cycles wasted during the  
short life of a research computer.  It should be an implicit goal for  
every researcher who controls a system to keep it doing useful work  
for themselves or their peers.

Full Disclosure: I am a part owner of a "cluster for hire" business  
trying to assemble the "wasted" cycles under so many desks to benefit  
all the researchers who won't miss the wirr and buzz under there  
desks...  Heck, we'll even put our nodes under their desk if that's  
what they need us to do :)

Sincerely,
Greg W. Keller
R Systems NA, inc.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090526/d3191ada/attachment.html>