[Beowulf] Southampton's RPi cluster is cool but too many cables?

Tue Sep 25 10:01:35 PDT 2012

On Tue, Sep 25, 2012 at 12:33:49PM -0400, Igor Kozin wrote:
>stored as characters (1 byte per char) the genome is ~ 3 GB. you could
>use two bits to represent the four letter alphabet but probably nobody
>does that. better yet, you can store only the difference against a
>known reference genome or use a dictionary based compression. however
>people like having metadata (e.g. quality) as well so the size varies.

Actually, the bioinformatics people do some/all of this.

The .2bit FASTA[1] format specifically compresses the ACGT data into
2 bits (T:00, C:01, A:10, G:11), plus some header/metada information.
Other formats such as 'VCF' specifically store variants against known
references[2].  The current *human* reference genome is several
gigabytes...compressed[0].

Some formats (e.g. 'SAM/BAM') have native compresion, but it tends to
use existing algorithms within the file[3], instead of novel routines.
Files that do not have native compression are, in practice, often
compressed using gzip or LZO utilties (we've found bzip2 to be too slow
relativel to the minor improvements in compression rates),

Refs:
[0]	http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
[1]	http://genome.ucsc.edu/FAQ/FAQformat.html#format7
[2]	http://www.1000genomes.org/node/101	
[3]	http://samtools.sourceforge.net/SAM1.pdf

>the current bread of sequencers which produce many short reads (70-200
>base pairs) with high coverage require lots of intermediate data
>flying around. indeed a single run can result in ~ 0.5 TB. if you have
>multiple sequencers you better have a lot of storage.

Or more.  We routinely have runs from a single machine that are over 1TB
in size from a single sequencer (although the average is lower).
The analysis takes a further 500G, although only about half of that is
needed long-term (as stored in .bam format).

>single-strand sequencing technologies promise to be more accurate and
>have long reads. still you don't want to wait a very long time for
>sequencing accurately a single strand so you chop it and do parallel
>processing. a laptop should be able to cope with post processing.

Maybe.  To process the data quickly--less than a day--you need a decent
IO system to read/write the data, and most laptops won't cut it.  The
actual CPU power for a single run isn't huge, but it's more than you
normally find in a laptop.

>
>
>On 25 September 2012 16:26, Ellis H. Wilson III <ellis at cse.psu.edu> wrote:
>> On 09/25/2012 11:17 AM, Prentice Bisbal wrote:
>>> Where did you get that data-point from? I've been told a single genome
>>> sequence takes up about 6 GB of data, and I think that's after it's been
>>> processed.
>>>
>>> According to this article,  raw sequence can take up between 2-30 TB,
>>> and a processed one 1.5 GB. (Disclaimer: I only read the executive summary)
>>
>> At 2TB and a 6 hour lifespan for this thing to pull in the genome,
>> you're talking about around 90MB/s to process, and at 30TB now we're
>> talking about near to 1.4GB/s to process.  Since it states that the
>> laptop is doing the processing of the genome, I have serious doubts that
>> this thing could a) do the processing fast enough to boil the data down,
>> b) offload the data fast enough via USB or whatever it's connected with
>> (even gigabit ethernet is only barely fast enough) or c) have 2-30TB of
>> storage inside of it ;D.
>>
>> So even ignoring where the data points come from or the reality of the
>> size of a genome, I'm upper-bounding the data it actually pulls in to
>> something at most at like 500GB.  And that's still probably too optimistic.
>>
>> Very interested to hear more about the specifics of this thing, if
>> anyone can find a solid number on it.
>>
>> Best,
>>
>> ellis
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)