[Beowulf] Station wagon full of tapes

Tue May 26 09:19:11 PDT 2009

Robert G. Brown wrote:

> Sure, but why wouldn't it be cheaper for e.g. NSF or NIH to fund an
> exact clone of the service Amazon plans to offer and provide it for free
> to its supported research groups (or rather, do bookkeeping but it is
> all internal bookkeeping, moving money from one pocket to another).

TANSTAAFL.  Someone, somewhere has to pay.  And to make this thing 
useful you need a ton of bandwidth, and you need it cheap.

Bandwidth is not cheap.

> Amazon has to make a profit.  Granting agencies don't have to pay the
> profit that Amazon has to make.  Amazon has to take substantial risks to
> make its profit.  Granting agencies have no risk.

No ... they have technological and organizational risks.  Technological: 
  will the thing work.  Organizational:  who is calling the shots, and 
how many political battles must be won to get the solution done.

> All of the things you assert for DNA sequencing are true for high energy
> physics.  Enormous datasets, lots of computation.  HEP's INTERNATIONAL
> solution is ATLAS, not Amazon.

Yup.

> Supporting commercial access into such a DB a la >>google<< but for
> genomic data, sure, but that's not really cluster computing, that's a
> large shared DB.  I could see that as a spin off data service of Amazon
> or Google or a new business altogether, but I'd view it as a niche and
> not really HPC.

Well ... it is cluster computing, but not as most participants here 
know/understand it.

There is a huge amount of processing associated with this data.  Its at 
minimum O(N) and often O(N x M) for some large M.  This processing is 
invariably integer based.

But the issue is that, for this research, the vast majority of end users 
(wet lab, etc) are IO and data motion bound.  I see this getting worse 
over time, not better.

While there is a strong push to try to do these things at Amazon and 
other locations, the CBA for doing many jobs with ever increasing bolus 
of data does in fact favor the small local HPC system.

> Grant funded research involving large scale shared data resources can
> ALWAYS be done more cheaply than by buying the data services from
> profit-making third parties unless there are nonlinear e.g. proprietary
> IP barriers.  This is trebly true given that research facilities are

Not just less expensve, but more bandwidth.  Campus bandwidths often 
completely dwarf what you can get out of the campus, even if you can get 
Internet2 access.  UMich has dedicated fibre pulls (at least they did in 
2002 when I worked with them) for gigabit between buildings. 
Expensive, but they needed it.

The names of the games are IO rates, and data motion rates between data 
sinks/sources.  Call processing effectively infinitely fast for these 
users.  Moving the data isn't.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615