[Beowulf] Southampton's RPi cluster is cool but too many cables?

Igor Kozin i.n.kozin at googlemail.com
Tue Sep 25 09:33:49 PDT 2012


"this thing" does only ~ 1/20 of the genome. you have to pay quite a
bit more for your full genome which makes it comparable (price-wise)
with other technologies. hopefully in a few years time it'll get
cheaper.
stored as characters (1 byte per char) the genome is ~ 3 GB. you could
use two bits to represent the four letter alphabet but probably nobody
does that. better yet, you can store only the difference against a
known reference genome or use a dictionary based compression. however
people like having metadata (e.g. quality) as well so the size varies.

the current bread of sequencers which produce many short reads (70-200
base pairs) with high coverage require lots of intermediate data
flying around. indeed a single run can result in ~ 0.5 TB. if you have
multiple sequencers you better have a lot of storage.

single-strand sequencing technologies promise to be more accurate and
have long reads. still you don't want to wait a very long time for
sequencing accurately a single strand so you chop it and do parallel
processing. a laptop should be able to cope with post processing.


On 25 September 2012 16:26, Ellis H. Wilson III <ellis at cse.psu.edu> wrote:
> On 09/25/2012 11:17 AM, Prentice Bisbal wrote:
>> Where did you get that data-point from? I've been told a single genome
>> sequence takes up about 6 GB of data, and I think that's after it's been
>> processed.
>>
>> According to this article,  raw sequence can take up between 2-30 TB,
>> and a processed one 1.5 GB. (Disclaimer: I only read the executive summary)
>
> At 2TB and a 6 hour lifespan for this thing to pull in the genome,
> you're talking about around 90MB/s to process, and at 30TB now we're
> talking about near to 1.4GB/s to process.  Since it states that the
> laptop is doing the processing of the genome, I have serious doubts that
> this thing could a) do the processing fast enough to boil the data down,
> b) offload the data fast enough via USB or whatever it's connected with
> (even gigabit ethernet is only barely fast enough) or c) have 2-30TB of
> storage inside of it ;D.
>
> So even ignoring where the data points come from or the reality of the
> size of a genome, I'm upper-bounding the data it actually pulls in to
> something at most at like 500GB.  And that's still probably too optimistic.
>
> Very interested to hear more about the specifics of this thing, if
> anyone can find a solid number on it.
>
> Best,
>
> ellis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list