[Beowulf] Station wagon full of tapes

Tue May 26 06:54:15 PDT 2009

I deal quite often with the "next-gen" DNA sequencing instruments that  
produce 1TB/day in TIFF images that are then distilled down to the DNA  
basecalls before the short reads are subjected to alignment. Then the  
resulting longer sequences are usually aligned again against a  
reference genome.

Lots of data, lots of computation.

The 1 Terabyte of TIFF images typically reduces down to about 200 GB  
in intermediate data which is further distilled down into a few  
hundred KB of actual sequence data. The entire process is interesting  
and it is a massive Bio/IT challenge as these types of terabyte-scale  
data producing lab instruments are popping up everywhere (the cost of  
one of these instruments is now easily within reach of a single grant- 
funded researcher at a facility of any size...). We are only a few  
technology revolutions away from these boxes showing up in your point  
of care primary physician's office (well not really, probably a  
backend service lab that your physician outsources to ...)

Anyway the new data ingestion service that Amazon offers is, I think,  
going to be a big deal in our field.

For the following reasons:

- Bio people are being buried in data
- Once we process the data to get the derived results, the primary  
data just needs to go somewhere cheap
- Amazon and other internet-scale people can do peta-scale or exa- 
scale storage far better & cheaper than any of my customers
- These instruments are popping up in wet labs across campus with weak/ 
anemic network links to IT core facilities and data centers
- Scientists in many cases are required to share data that is grant  
funded
- Amazon has some neat "downloader pays" models that make it easier  
for researchers to affordably offer up peta-scale data sets for sharing

I suspect that very large amount of scientific data will be making a 1- 
way trip into the cloud. The data will stay there "forever" as a deep  
store. In the ocasional cases where the data needs to be re-processed  
or re-analyized it would be not unreasonable to fire up some cloud  
server nodes to do the re-work in-situ.

The disk ingest service was the final piece. I can see this happening  
in life science environments:

- Massive data generated in the wet lab
- Captured to local storage (10 - 40TB) with small HPC component
- Data is processed locally into derived and distilled forms
- Derived data replicated to campus/lab facilities for online primary  
storage
- Derived data (and possibly the full raw data) is compressed, placed  
onto drives and ingested into Amazon for long term storage
- If re-analysis is ever needed, have existing EC2 AMIs preloaded with  
the necessary software

Basically it comes down to the fact that Amazon may be able to offer  
big-yet-slow storage in the terabyte to petabyte range at levels of  
cost and geographical redundancy that would be extremely difficult to  
match with local resources at a small non-specialized organization.

My $.02 of course

-Chris

On May 26, 2009, at 8:58 AM, Jeff Layton wrote:

> Gerry Creager wrote:
>> There was an interesting brainstorming session at Rocks-A-Palooza a  
>> couple of weeks ago.  Someone wants to offer Amazon resources.   
>> Problem remains for me: How can I get sufficient cloud resources  
>> for computing (I'll hammer on dataset transport in a moment) that  
>> will handle reasonable weather models with their small message MPI  
>> chatter, and lots of file I/O?  I've been assured that Amazon's  
>> ready to accommodate that.
>
> This is one of the problems - clouds aren't ready for this kind of
> usage model yet. They only have GigE and usually it's oversubscribed.
> When you say file IO, they hear capacity, not performance (either
> throughput or IOPS). And as you point out, the pipe to/from the
> cloud is not ready for lots of data.
>
>> However, getting data into S3 for availability, when a daily multi- 
>> gigabyte dataset is used for initiation, and another is created as  
>> output, is going to be expensive, and likely slow.  I think there  
>> are other approaches that have to be evaluated.  I am not sure the  
>> cloud is ready for MPI play on a significant basis, just yet.
>
> I haven't seen the cloud ready yet for anything other than  
> embarrassingly
> parallel codes (i.e. since node, small IO requirements). Has anyone  
> seen
> differently? (as an example of what might work, CloudBurst seems to be
> gaining some traction - doing sequencing in the cloud. The only  
> problem
> is that sequencing can generate a great deal of data pretty rapidly).
>
> Jeff
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf